Skip to content

Recovery Benchmarks

Measuring failure detection and recovery speed using Toxiproxy.

Metrics

Metric Description Unit
error_detection_latency Time from failure injection to first detection ms
error_propagation_latency Time until all threads detect failure ms
full_recovery_latency Time until all threads fully recover ms

Failure Types

TCP RST (reset_peer)

Simulates immediate connection close from server side.

toxic_reset_peer(rabbitmq_proxy, timeout_ms=0)

Real scenarios:

  • RabbitMQ process crash
  • OOM kill
  • Hardware failure
  • Network device reset

Network Partition (timeout)

Simulates complete network connection break.

toxic_timeout(rabbitmq_proxy, timeout_ms=100)

Real scenarios:

  • Network cable disconnect
  • Firewall block
  • Network partition in distributed systems
  • Cloud provider network issues

Degraded Network + Failure

Simulates slow network that then completely fails.

toxic_latency(rabbitmq_proxy, 200, jitter_ms=100)  # 200ms latency
# ... some time passes ...
toxic_reset_peer(rabbitmq_proxy, timeout_ms=0)      # then failure

Results

Recovery Time Comparison

Recovery Times

Detailed Results

TCP RST Recovery (50 threads)

Metric Value
Error Detection -0.67 ms*
Error Propagation 513.47 ms
Full Recovery 515.63 ms
Threads Detected 11/50
Threads Recovered 50/50

Negative detection time

Negative value means some threads detected the error before the injection time was formally recorded (race in measurement).

Network Partition Recovery (50 threads)

Metric Value
Error Detection 105.31 ms
Error Propagation 105.83 ms
Full Recovery 1530.35 ms
Threads Detected 50/50
Threads Recovered 50/50

Analysis

Why is Network Partition slower?

TCP RST:
  Client ←──[RST]── Server
  └─ Immediate notification

Network Partition:
  Client ──────X──── Server
  └─ Must wait for timeout (default ~100ms in test)

Target Metrics

Metric Target Result
TCP RST Detection < 10ms ✅ 0.48ms
Network Partition Detection < 200ms ✅ 105ms
Full Recovery < 5s ✅ 1.5s

Reproduction

Requirements

Recovery tests require Toxiproxy:

docker compose -f docker-compose.test.yml up -d

# All recovery tests
pytest tests/benchmarks/bench_recovery_latency.py -v

# TCP RST only
pytest tests/benchmarks/bench_recovery_latency.py::TestRecoveryLatency::test_recovery_after_reset_peer -v