·9 min read·Rishi

Monitoring What Matters: Setting Up Alerts That Don't Cry Wolf

Monitoring What Matters: Setting Up Alerts That Don't Cry Wolf

Your team has 340 active alerts. Twelve fired last night. Nobody looked at any of them. The on-call engineer's phone buzzed, they glanced at the notification, saw it was the same "CPU above 80%" alert that fires every night during the batch job, and went back to sleep. This morning, a customer reported that the payment API has been returning errors since 2am. There was an alert for it — buried in the noise.

This is alert fatigue, and it is the most dangerous failure mode in operations. It is not that you lack monitoring. It is that your monitoring has trained your team to ignore it.

Let's fix that.

The Four Golden Signals

Google's SRE book identified four signals that matter for any service. If you monitor nothing else, monitor these:

SignalWhat It MeasuresExample
LatencyTime to serve a requestP95 response time > 500ms
TrafficDemand on your systemRequests per second
ErrorsRate of failed requestsHTTP 5xx > 1% of total traffic
SaturationHow full your resource isMemory > 90%, connection pool > 80%

Most teams over-index on saturation (CPU alerts, disk alerts, memory alerts) and under-index on everything else. But saturation is a leading indicator — it tells you something might break. Errors and latency are the signals that tell you something is broken. Prioritize accordingly.

Designing Thresholds That Mean Something

A threshold is only useful if crossing it means something has changed and action is required. Here is how to set them:

Step 1: Establish your baseline

Before you set any threshold, you need to know what normal looks like. Run your system for two weeks and observe:

  • What is the normal P95 latency? Not the average — the P95. Averages hide problems
  • What is the normal error rate? Most systems have a non-zero baseline error rate from bots, bad clients, and transient issues
  • What does traffic look like by hour, by day, by week? A traffic drop at 2am on Sunday is normal. A traffic drop at 2pm on Tuesday is an outage

Step 2: Set thresholds above the noise floor

If your normal P95 latency is 200ms and it spikes to 250ms during peak hours, an alert at 300ms will fire during every peak. Set it at a level that indicates genuine degradation:

# Bad: fires during normal peak
P95 latency > 300ms for 5 minutes

# Good: fires when something is actually wrong
P95 latency > 500ms for 5 minutes

The key question: would I take action if this fires? If the answer is "probably not," raise the threshold or delete the alert.

Step 3: Use sustained conditions, not spikes

A single data point above threshold is noise. Require the condition to persist:

// Azure Monitor alert rule (KQL)
requests
| where timestamp > ago(10m)
| summarize
    p95_duration = percentile(duration, 95),
    error_rate = countif(success == false) * 100.0 / count()
| where p95_duration > 500 or error_rate > 2.0

The 10-minute window means a single slow request does not page anyone. A sustained problem does.

Severity Levels and Routing

Not every alert deserves the same response. Define clear severity levels and route them appropriately:

SeverityMeaningResponseChannel
Sev 0Customer-facing outageImmediate page, all handsPagerDuty phone call
Sev 1Degraded experienceOn-call responds within 15 minPagerDuty push notification
Sev 2Potential issue developingInvestigate during business hoursSlack #ops-alerts
Sev 3Informational / trendReview in weekly ops meetingSlack #ops-metrics

The routing matters as much as the threshold. A Sev 2 alert that pages someone at 3am will erode trust in the system just as fast as a noisy threshold.

The 3am Test

For every alert, ask: "If this fires at 3am, is it worth waking someone up?"

  • "CPU is at 85%" — No. It might come back down. Create a ticket
  • "Error rate jumped from 0.5% to 15%" — Yes. Customers are failing
  • "Disk is at 90%" — Maybe. How fast is it growing? If it'll hit 100% by morning, yes. If it's been at 90% for a week, no
  • "Zero traffic on the payment API" — Absolutely yes. Something is very wrong

If an alert is not worth waking someone up, it should not be routed to PagerDuty. Period.

Composite Alerts

Single-metric alerts often lack context. Composite alerts combine signals to reduce false positives:

# Single metric (noisy):
Alert if: CPU > 80%

# Composite (meaningful):
Alert if: CPU > 80% AND response_time_p95 > 500ms AND error_rate > 2%

The composite version fires only when high CPU is actually causing user impact. CPU at 85% during a batch job with normal response times? No alert. CPU at 85% with errors spiking? That is real.

In Azure Monitor, you can create composite alerts using action rules and alert processing rules, or use Application Insights smart detection which automatically correlates multiple signals.

Here is a practical composite alert using Log Analytics:

let latencyIssue = requests
| where timestamp > ago(5m)
| summarize p95 = percentile(duration, 95)
| where p95 > 500;

let errorIssue = requests
| where timestamp > ago(5m)
| summarize errorRate = countif(success == false) * 100.0 / count()
| where errorRate > 2.0;

latencyIssue
| join kind=inner (errorIssue) on $left.p95 == $left.p95
| project AlertTime = now(), P95_ms = p95, ErrorRate = errorRate

Anomaly Detection vs Static Thresholds

Static thresholds work well for signals with predictable baselines. But some metrics have patterns — traffic follows business hours, batch jobs run at specific times, usage varies by day of week.

For these, anomaly detection (also called dynamic thresholds in Azure Monitor) learns the pattern and alerts when the metric deviates from its expected range:

  • Traffic drops 60% compared to the same hour last week? Alert
  • Traffic drops 60% at 11pm on a Saturday? Expected — no alert

Azure Monitor dynamic thresholds support three sensitivities:

SensitivityFalse PositivesMissed Alerts
HighMoreFewer
MediumBalancedBalanced
LowFewerMore

Start with Medium and adjust based on alert quality. If you are getting false positives, lower the sensitivity. If you are missing real issues, raise it.

Runbooks: Every Alert Needs a Response Plan

An alert without a runbook is just anxiety. When the on-call engineer gets paged at 3am, they should not be improvising. Every alert should link to a runbook that answers:

  1. What does this alert mean? Plain language, not metric names
  2. What is the customer impact? Is anyone affected right now?
  3. What are the common causes? Top 3 reasons this alert fires
  4. What are the immediate steps? Check X, run Y, restart Z
  5. When to escalate? If step 4 does not resolve it, who to call
  6. How to verify resolution? What metric should you watch to confirm the fix

Store runbooks in your wiki, linked directly from the alert definition. In Azure Monitor, use the Action Group description field to include the runbook URL:

az monitor action-group create \
  --resource-group rg-monitoring \
  --name 'PaymentAPI-Sev1' \
  --short-name 'PayAPISev1' \
  --action webhook oncall-webhook 'https://hooks.pagerduty.com/...' \
  --action email ops-team 'ops@company.com'

Include the runbook link in the alert's custom properties so it appears in every notification.

A Real Alert Setup

Here is what a well-designed alerting configuration looks like for a payment API:

AlertThresholdWindowSeverityRoute
Error rate > 5%5% of requests return 5xx5 min sustainedSev 0PagerDuty call
Error rate > 1%1% of requests return 5xx10 min sustainedSev 1PagerDuty push
P95 latency > 2s95th percentile > 2000ms10 min sustainedSev 1PagerDuty push
P95 latency > 800ms95th percentile > 800ms15 min sustainedSev 2Slack
Zero traffic0 requests received3 min sustainedSev 0PagerDuty call
Connection pool > 80%Active connections > 80% of max10 min sustainedSev 2Slack
Dependency failure rate > 10%Calls to downstream services failing5 min sustainedSev 1PagerDuty push

Notice what is not on this list: CPU utilization, memory usage, disk I/O. Those are on a dashboard for investigation, but they do not page anyone. They are causes, not symptoms. Alert on symptoms — the things your customers feel.

Reducing Alert Count

If you currently have hundreds of alerts, here is a practical cleanup process:

  1. Export every alert that fired in the last 30 days
  2. Categorize each as: led to action, was investigated and dismissed, or was ignored
  3. Delete every alert that was ignored more than 3 times. If no one acts on it, it is noise
  4. Raise thresholds on alerts that were investigated and dismissed. The threshold is too sensitive
  5. Keep alerts that led to action. These are your real signals
  6. Review monthly. Alert quality degrades over time as systems change

Target: fewer than 20 active alerts per service. If you need more, your service is too complex or your alerts are too granular.

The Takeaway

Good alerting is not about catching everything. It is about catching the right things and routing them to the right people at the right urgency. Every alert should pass the 3am test. Every alert should have a runbook. And every alert that nobody acts on should be deleted.

Your monitoring system is not a safety net if your team has learned to ignore it. Fix the signal-to-noise ratio, and the monitoring starts working again.

Comments

No comments yet. Be the first!