Performance profile¶

Profiling results from motel's engine benchmarks on an Apple M1 (8-core, 16GB). Run with Go 1.25 and the OTel Go SDK v1.40.

Throughput ceiling¶

The engine's throughput is primarily controlled by the inter-arrival sleep in the main loop, not by CPU. However, trace generation time is additive to the sleep interval, so the effective rate is lower than the target — particularly at high rates where the per-trace generation time becomes a significant fraction of the interval:

Target rate	Actual spans/sec	Spans per run (1s)	Allocs/run
100/s	363	364	5,209
1,000/s	3,440	3,443	49,080
5,000/s	16,470	16,475	234,120
10,000/s	32,070	32,077	457,120

The test topology produces ~3.6 spans per trace (gateway, backend, database, cache). At 10,000 traces/s the engine produces ~32,000 spans/s with a noop exporter rather than the theoretical ~36,000, because each ~4 us trace generation eats into the 100 us inter-arrival interval. This is still well above the rate at which most collectors start dropping data, so motel is not the bottleneck at any rate in the stress-test guide.

Per-span cost¶

From BenchmarkWalkTrace with a 4-service topology (4 spans/trace):

~3,900 ns/trace (~975 ns/span)
6,112 bytes/trace (~1,528 bytes/span)
54 allocs/trace (~13.5 allocs/span)

Most allocation comes from the OTel SDK, not from motel's engine code:

Source	Share of alloc_space
OTel `newRecordingSpan`	40%
OTel `SetAttributes` (slice grow)	27%
motel `walkTrace` (attr slices, maps)	24%
OTel span options	9%

CPU hot paths¶

From a CPU profile of BenchmarkWalkTrace (4-service topology, noop exporter):

CPU flamegraph from BenchmarkWalkTrace

The flamegraph shows walkTrace and executeCall as the widest application frames, with the OTel SDK ((*tracer).Start, newRecordingSpan, SetAttributes) underneath. Runtime and GC are visible but modest.

To reproduce this flamegraph:

go test -run=NONE -bench=BenchmarkWalkTrace -benchtime=5s \
  -cpuprofile=cpu.prof ./pkg/synth/
go tool pprof -raw cpu.prof | stackcollapse-go.pl | flamegraph.pl > flamegraph.svg

Sample pprof -text output from the same profile:

      flat  flat%   sum%        cum   cum%
     0.24s  2.11% 46.52%         4s 35.24%  synth.(*Engine).walkTrace
     0.03s  0.26% 46.78%      2.91s 25.64%  synth.(*Engine).executeCall
     0.05s  0.44% 47.22%      1.76s 15.51%  sdk/trace.(*tracer).Start
     0.25s  2.20% 49.43%      1.15s 10.13%  sdk/trace.(*tracer).newSpan
     0.17s  1.50% 54.01%      0.67s  5.90%  sdk/trace.(*recordingSpan).SetAttributes

The engine's own logic (walkTrace + executeCall) accounts for about 60% of CPU, with the OTel SDK taking the remaining 40%. GC pressure is modest at ~5%.

stdout vs OTLP export¶

Benchmark allocations at 1,000 traces/s for 1 second (noop exporter vs stdout writing to io.Discard):

Exporter	Spans/sec	Allocs/run	Bytes/run
noop	3,456	49,270	5.5 MB
stdout (discard)	3,130	187,640	13.4 MB

stdout serialisation adds ~3.8x more allocations and ~2.4x more memory per run.

CPU usage¶

Measured with /usr/bin/time on the real binary over a 10-second run with a 4-service topology. CPU percentage is user+sys time divided by wall time, representing usage of a single core:

Target rate	stdout CPU	OTLP CPU
1,000/s	14%	6%
5,000/s	28%	13%
10,000/s	39%	20%

OTLP export (to a local collector with a noop exporter) uses roughly half the CPU of stdout, because the OTel SDK batches spans and the collector handles serialisation.

When does motel fall behind?¶

With OTLP export, motel keeps pace at all tested rates up to 10,000 traces/s.

With --stdout, the synchronous JSON serialisation becomes a bottleneck at high rates. At a target of 10,000 traces/s, the stdout exporter achieves only ~5,900 traces/s (~23,500 spans/s). For rates above ~5,000 traces/s, use OTLP export instead.

If you see interval drift (actual rate lower than target), check:

Whether you're using --stdout — switch to OTLP for high rates
Collector queue depth and retry backoff
Network latency if sending OTLP to a remote endpoint

Profiling a live run¶

motel includes a --pprof flag that starts a pprof HTTP server:

motel run --pprof :6060 --stdout --duration 30s your-topology.yaml > /dev/null

Then in another terminal:

# CPU profile (30-second sample)
go tool pprof 'http://localhost:6060/debug/pprof/profile?seconds=30'

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine dump
curl 'http://localhost:6060/debug/pprof/goroutine?debug=2'

To generate a flamegraph from a live run:

# Save a 30-second CPU profile
curl -o cpu.prof 'http://localhost:6060/debug/pprof/profile?seconds=30'

# Convert to flamegraph (requires flamegraph.pl and stackcollapse-go.pl)
go tool pprof -raw cpu.prof | stackcollapse-go.pl | flamegraph.pl > flamegraph.svg

Recommendations¶

For collector stress testing, 5,000-10,000 traces/s is safe with OTLP export. At 10,000 traces/s motel uses ~20% of one core, leaving ample headroom
--stdout tops out at ~5,000-6,000 traces/s. Use it for debugging and small-scale testing. For sustained high-rate runs, use OTLP export
The per-span allocation cost (~1.5 KB, 13 allocs) is dominated by the OTel SDK. Reducing motel's own allocations would yield diminishing returns
If profiling a specific topology, use --pprof :6060 and go tool pprof to identify whether the bottleneck is motel, the exporter, or the collector