Circuit Breakers: Stopping a Cascading Failure Before It Spreads
A recommendation service starts responding in 8 seconds instead of 80 milliseconds. It is not down — just slow, maybe a bad deploy or a struggling database. Your product-page service calls it on every request. Each request now holds a thread for 8 seconds waiting. Your thread pool fills. New requests queue, then time out. The product page, which only decorates itself with recommendations, is now fully down. Meanwhile the checkout service that shares a load balancer with the product page starts getting starved too. One slow, non-critical dependency just took out your storefront.
The circuit breaker exists to break this chain. It is named after the electrical kind for a reason: when current surges, the breaker trips and cuts the circuit so the house doesn't burn down. You flip it back on once things are safe.
The failure it prevents
The danger isn't the dependency being slow. The danger is your service waiting on it. Every thread parked on a slow call is a thread not serving other traffic. Retries make it worse — three retries against a struggling service triple its load at the exact moment it can least handle it, and you've built a self-reinforcing death spiral. This is how a localized problem becomes a cascading failure that crosses service boundaries.
A circuit breaker changes the default. Instead of "call the dependency and hope," it tracks the dependency's recent health and, when health is bad, fails immediately without making the call at all. Fast failure frees your threads, sheds load off the sick dependency, and lets you serve a degraded-but-alive experience.
Three states
The breaker is a small state machine wrapped around a remote call.
- Closed — normal operation. Calls pass through. The breaker counts failures.
- Open — tripped. Calls fail instantly (or return a fallback) without touching the dependency. After a cooldown, it moves to half-open.
- Half-open — probationary. A limited number of trial calls are allowed through. If they succeed, the breaker closes and traffic resumes. If they fail, it snaps back open and the cooldown restarts.
failures exceed threshold
CLOSED ─────────────────────────────▶ OPEN
▲ │
│ trial calls succeed cooldown elapses
│ ▼
└────────────── HALF-OPEN ◀─────────┘
(one bad trial → back to OPEN)
The half-open state is the part people skip, and skipping it is a mistake. Without it you either stay open forever (and never recover) or slam the full load back onto a dependency that may still be sick. Half-open is a single careful knock on the door before you let everyone back in.
A minimal implementation
import time
class CircuitBreaker:
def __init__(self, fail_threshold=5, cooldown=30, trial_calls=1):
self.fail_threshold = fail_threshold
self.cooldown = cooldown
self.trial_calls = trial_calls
self.state = "closed"
self.failures = 0
self.opened_at = 0.0
self.trials = 0
def call(self, fn, *args, fallback=None):
if self.state == "open":
if time.monotonic() - self.opened_at >= self.cooldown:
self.state = "half_open"
self.trials = 0
else:
return self._reject(fallback)
try:
result = fn(*args)
except Exception:
self._on_failure()
if fallback is not None:
return fallback
raise
else:
self._on_success()
return result
def _on_success(self):
if self.state == "half_open":
self.trials += 1
if self.trials >= self.trial_calls:
self.state = "closed"
self.failures = 0
else:
self.failures = 0
def _on_failure(self):
self.failures += 1
if self.state == "half_open" or self.failures >= self.fail_threshold:
self.state = "open"
self.opened_at = time.monotonic()
def _reject(self, fallback):
if fallback is not None:
return fallback
raise RuntimeError("circuit open")
Production libraries — Resilience4j on the JVM, Polly in .NET, the breakers built into most service meshes — add rolling time windows, percentage-based thresholds, and metrics. But this is the whole idea. A breaker is not complicated; getting the thresholds right is the hard part.
Tuning is where people get it wrong
The defaults above are illustrative, not gospel. Bad thresholds make the breaker either useless or itself an outage:
- Too sensitive and the breaker trips on a normal traffic blip, cutting off a healthy dependency and causing the outage it was meant to prevent.
- Too lax and it never trips when it should, so you get no protection.
Prefer a percentage over a rolling window ("trip if more than 50% of the last 20 calls failed") to a raw count, because a flat count behaves wildly differently at 10 requests/second versus 10,000. And count slow calls as failures, not just errors — a dependency returning 200 OK in 9 seconds is doing more damage than one returning a clean 503 in 5 milliseconds. The whole point is to protect your threads; a slow success still parks a thread.
Fallbacks: what to do while the breaker is open
Failing fast is only half the value. The other half is what you return instead. Good fallbacks degrade gracefully:
- Serve a cached or slightly stale version of the data.
- Return a sensible default (an empty recommendations list, not an error page).
- Drop the optional feature entirely and render the rest of the page.
The principle: a non-critical dependency should never be able to fail a critical path. If recommendations are down, the product page still loads — just without recommendations. Decide the fallback deliberately at design time, because the worst time to discover you have no fallback is when the breaker has just tripped in production.
Where it fits — and where it doesn't
Circuit breakers belong on every cross-service and cross-network boundary: calls to other microservices, third-party APIs, and anything else that can be slow or unavailable independently of you. Pair them with timeouts (so a call can't hang forever in the first place) and bulkheads (isolated thread pools per dependency, so one saturated pool can't starve the others). The three together — timeout, bulkhead, breaker — are the standard kit for resilient service-to-service calls.
They do not help with in-process logic, a slow local computation, or a fundamentally overloaded system where every dependency is the problem. A breaker contains a localized failure; it cannot conjure capacity you do not have. But for the overwhelmingly common case — one dependency goes bad and threatens to take everything down with it — the circuit breaker is the difference between a degraded feature and a front-page outage.
Keep reading
Backpressure: What Happens When Your System Can't Keep Up
A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.
Database Connection Pooling: The Bottleneck You Forgot to Tune
More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.
Write-Ahead Logging: The Unsung Hero of Database Durability
How does a database survive a power cut mid-write without corrupting your data? The answer is a deceptively simple rule: log the change before you apply it. Here is why WAL is everywhere.
Newsletter
New posts, straight to your inbox
One email per post. No spam, no tracking pixels, unsubscribe anytime.
Comments
- No comments yet. Be the first.