·7 min read·Rishi

The Incident That Taught Me to Never Skip Staging — A Production Outage Story

The Incident That Taught Me to Never Skip Staging — A Production Outage Story

It was a Tuesday afternoon, and the fix was one line of code.

A customer had reported that their invoice PDFs were showing the wrong tax rate. I traced it in under ten minutes — a config value that should have been 0.0825 was hardcoded as 0.08 in one of the service's environment-specific settings. Not even a code change. Just a config update.

I had shipped hundreds of config changes before. Staging seemed like overkill for this one. The change was trivial. The customer was frustrated. My manager was in a meeting and I wanted to have it fixed by the time they got back. So I pushed the config change directly to production.

That decision would cost me a full night of sleep and our team three days of incident response.

The First Sign

Nothing happened for about 45 minutes. I closed the support ticket, replied to some emails, and started working on a feature branch. Then Slack lit up.

@channel — Getting reports of 500 errors on the billing API. Anyone deploy something?

My stomach dropped. I checked the deployment logs — nothing had shipped except my config change. But the billing API was a different service. My change was to the invoice renderer. There was no way my config update caused this. Right?

Wrong.

The Cascade

Here is what actually happened, and I did not piece this together until 2am.

The invoice renderer and the billing API shared an Azure App Configuration store. This is something I knew in the abstract but had never really internalized. When I pushed my config change through the Azure portal, I updated a key that the billing API also read at startup.

But here is the part that made it a real outage: the App Configuration store had a feature flag that was keyed similarly to my config entry. My change to tax.rate.default triggered a refresh event that the billing API interpreted as a configuration reload signal. During the reload, the billing API momentarily lost its database connection string — not because I changed it, but because the reload logic had a race condition that had existed for months. In dev and staging, the database was local and reconnected instantly. In production, the Azure SQL connection required re-authentication through a managed identity, which took 8-12 seconds.

During those 8-12 seconds, every billing request failed with a 500.

The retry storms from the frontend made it worse. Each failed request retried three times with no backoff. The billing API's thread pool saturated. The health check endpoint — which shared the same thread pool — started timing out. The load balancer marked the instance as unhealthy and removed it from rotation. Now all traffic hit the remaining two instances, which were already struggling. Within four minutes, all three instances were marked unhealthy.

Total billing API outage. From a config change to the invoice renderer.

The 3am Scramble

By 8pm, we had the billing API back up by manually restarting all three instances. But the damage was spreading. Failed billing calls had left transactions in a half-committed state. The order processing queue was backing up because orders could not complete without billing confirmation. Customers were seeing "payment processing" spinners that never resolved.

At 11pm, I was still at my desk writing a query to identify the stuck transactions. My teammate Priya joined from home. By 1am, we had the list — 847 transactions needed manual reconciliation.

At 3am, the on-call engineer for the payment gateway team called me. Their system was flagging our account for unusual retry patterns. We were close to hitting a rate limit that would have cut off payment processing entirely.

We fixed it by 4:30am. I drove home as the sun was coming up.

What Was Actually Broken

The postmortem took two days. We found five contributing factors:

  1. Shared configuration store without isolation. The invoice renderer and billing API should not have been reading from the same App Configuration namespace. A change to one service's config should never trigger a reload in another service.

  2. Race condition in configuration reload logic. The billing API's IOptionsMonitor callback disposed the current database connection before the new one was established. In local development, this was invisible because SQLite reconnects are instant. In production with Azure SQL and managed identity, the gap was fatal.

  3. No retry backoff on the frontend. Three immediate retries with no exponential backoff turned a 12-second blip into a sustained load spike that crashed the service.

  4. Health check shared the application thread pool. When the app was overloaded, the health check was overloaded too. The load balancer could not distinguish "app is busy" from "app is dead" and removed healthy-but-overloaded instances.

  5. No staging validation for config changes. We had a full CI/CD pipeline for code changes that went through staging. Config changes deployed through the portal bypassed all of it. There was no process, no gate, no smoke test.

None of these were new problems. They had existed for months. But they had never been triggered simultaneously. My config change was the match, not the dynamite.

The Process Changes

We came out of the postmortem with six concrete changes:

1. Config changes go through the pipeline

No more portal changes. Every configuration update is now a pull request to a config-as-code repository. It deploys through the same staging-then-production pipeline as application code.

# Config is now versioned and deployed like code
resources:
  repositories:
    - repository: config
      type: git
      name: myorg/app-config
      ref: main

2. Isolated configuration namespaces

Each service gets its own App Configuration namespace. Shared settings are explicitly replicated, not implicitly shared.

3. Resilient configuration reload

We rewrote the reload logic to keep the old connection alive until the new one is confirmed healthy:

// Before: dispose-then-create (broken)
_connection.Dispose();
_connection = CreateNewConnection(newConfig);

// After: create-then-swap (safe)
var newConnection = CreateNewConnection(newConfig);
await newConnection.OpenAsync(); // verify it works
var old = Interlocked.Exchange(ref _connection, newConnection);
old.Dispose();

4. Exponential backoff on all HTTP clients

Every service-to-service HTTP call now uses Polly with exponential backoff and circuit breaking. No more retry storms.

5. Dedicated health check endpoint

Health checks run on a separate thread pool and check only liveness, not readiness. The load balancer no longer removes instances that are busy but functional.

6. Staging is not optional

This is the cultural change, and it was the hardest. We made it a team agreement: nothing reaches production without passing through staging first. Not code, not config, not infrastructure changes. No exceptions, no matter how small the change looks.

We enforced it technically by removing direct production deployment permissions from individual contributors. The only path to production is through the pipeline, and the pipeline always goes through staging.

The Lesson

The temptation to skip staging is always the same: "this change is too small to break anything." But outages are not caused by big, obvious changes. They are caused by small changes interacting with hidden assumptions. My one-line config update was correct. It did exactly what I intended. But it triggered a chain of failures that no one anticipated because no one had tested it in an environment that looked like production.

Staging is not a checkbox. It is the place where your assumptions meet reality. Skip it, and you are betting that you understand every interaction in your system perfectly. You do not. I did not. No one does.

The fix for the invoice tax rate? It worked perfectly. It just needed to go through staging first.

Comments

No comments yet. Be the first!