Recovery Benchmarks
Measuring failure detection and recovery speed using Toxiproxy.
Metrics
| Metric | Description | Unit |
|---|---|---|
error_detection_latency |
Time from failure injection to first detection | ms |
error_propagation_latency |
Time until all threads detect failure | ms |
full_recovery_latency |
Time until all threads fully recover | ms |
Failure Types
TCP RST (reset_peer)
Simulates immediate connection close from server side.
Real scenarios:
- RabbitMQ process crash
- OOM kill
- Hardware failure
- Network device reset
Network Partition (timeout)
Simulates complete network connection break.
Real scenarios:
- Network cable disconnect
- Firewall block
- Network partition in distributed systems
- Cloud provider network issues
Degraded Network + Failure
Simulates slow network that then completely fails.
toxic_latency(rabbitmq_proxy, 200, jitter_ms=100) # 200ms latency
# ... some time passes ...
toxic_reset_peer(rabbitmq_proxy, timeout_ms=0) # then failure
Results
Recovery Time Comparison

Detailed Results
TCP RST Recovery (50 threads)
| Metric | Value |
|---|---|
| Error Detection | -0.67 ms* |
| Error Propagation | 513.47 ms |
| Full Recovery | 515.63 ms |
| Threads Detected | 11/50 |
| Threads Recovered | 50/50 |
Negative detection time
Negative value means some threads detected the error before the injection time was formally recorded (race in measurement).
Network Partition Recovery (50 threads)
| Metric | Value |
|---|---|
| Error Detection | 105.31 ms |
| Error Propagation | 105.83 ms |
| Full Recovery | 1530.35 ms |
| Threads Detected | 50/50 |
| Threads Recovered | 50/50 |
Analysis
Why is Network Partition slower?
TCP RST:
Client ←──[RST]── Server
└─ Immediate notification
Network Partition:
Client ──────X──── Server
└─ Must wait for timeout (default ~100ms in test)
Target Metrics
| Metric | Target | Result |
|---|---|---|
| TCP RST Detection | < 10ms | ✅ 0.48ms |
| Network Partition Detection | < 200ms | ✅ 105ms |
| Full Recovery | < 5s | ✅ 1.5s |