May 12, 2026·8 min read

The Incidents That Hide Between Alerts

The worst outages don't trip a threshold. A guide to missed incidents: silent services, slow cascades, new errors, latency drift — and how to catch them.

detectionincidentsproductionobservabilityalerting

The incidents that don't alert are the ones that cost the most. The dashboard was green, the error rate was within tolerance, nothing paged — and the outage still ran for four hours and lost you a customer. That's not bad luck. It's the predictable result of a monitoring model that only catches the failures you thought to write a rule for.

Threshold alerts are good at one job: telling you when a number you already named crosses a line you already drew. Latency over 500ms, pages. Error rate over 5%, pages. CPU over 80%, pages. The problem is that the expensive incidents almost never look like that. They live in the gaps — in the *absence* of normal behavior, in slow drift that never crosses the line, in errors that didn't exist yesterday, so no rule could match them.

Here are the failure classes we see hide between the alerts. None of them reliably trip a static threshold. All of them are catchable, if your detection layer is watching for change and not just for limits.

The error that has never appeared before

At 3:14am a line drops into the auth service logs: connection pool exhausted after 30s waiting for slot. Twelve times in three minutes. Then it spreads. By morning, logins are silently failing for 4% of users.

Nobody had a threshold for this, because nobody could. The message didn't exist in the codebase until a deploy three hours earlier. There's no historical baseline, no rule that matches, and twelve errors in five minutes won't move a count-based alarm. The signal isn't the volume. It's the novelty. This specific pattern has *never existed before*.

Catching it means fingerprinting every error message, normalizing out the parts that change (IDs, timestamps, ports), and flagging the first time a new fingerprint shows up, within minutes of its first occurrence, not after it scales into a count threshold. We go deep on the mechanics in catch new errors before your users do.

The service that went quiet

A worker that processes Stripe webhooks gets OOM-killed at midnight. The kernel takes it down before the app can write anything, so there's no error log. The logs just stop. Six hours later, support starts asking about missing receipts.

This is the deadliest class, because zero is technically within tolerance. No error rate to breach, no spike to catch. Absence is the failure.

And absence is multi-signal. A service can go dark in three ways at once: its log cadence drops to nothing, its trace throughput flatlines, and the metric exporter that used to scrape it stops reporting. Watching only one of those leaves the other two as blind spots — a service can stop emitting traces while still logging heartbeats, or stop logging while a sidecar keeps metrics flowing. Detection also has to be per-service and per-time-of-day, or you page on the 59 minutes of expected silence between hourly batch runs. We unpack absence detection in silent failures: the bug that won't page you.

The latency that crept

Your checkout API runs at p99 80ms most days. Over three hours it drifts — 140ms, then 200ms, then 240ms. Your alert fires at 500ms, set high on purpose so nobody gets paged every Friday afternoon.

By the time 240ms becomes 500ms, you've lost hours of conversion and customers are filing tickets. Then you spend forty minutes diffing commits to find what changed.

A fixed threshold can't catch drift, because drift is defined by *what's normal for this service right now*, not by an absolute number. 240ms on a Tuesday at 2pm can be a five-sigma event for a checkout API that normally sits at 80ms. Nowhere near 500ms, and badly wrong. The fix is a baseline that knows your weekly and daily shape, evaluated against the latency you actually see in metrics and trace spans, with the deploy that introduced the regression linked to the anomaly automatically.

The cascade that fired seven pages

The auth service starts returning 5xx on 2% of requests. Seven services that depend on it begin failing downstream: checkout can't validate sessions, the user service can't refresh profiles, notifications can't resolve recipients. Seven separate alerts fire within ninety seconds. The on-call now has seven pages, seven Slack threads, and seven dashboards open, trying to find which one is the root and which six are symptoms.

The trap is that the loudest service is rarely the cause. Sort by error count and you'll debug the victim while the culprit keeps burning. Worse, you might restart a downstream service when the root is upstream — which often makes the cascade worse.

Two things break this. First, the seven pages need to collapse into one incident, grouped by what they share — overlapping trace IDs, timing, correlated failure signatures — so you triage one thing instead of seven. That's the alert-storm problem; we cover the grouping mechanics in when fifty alerts are one incident. Second, the incident needs a cited root cause: which service moved first, who points at whom in the error text, and what the dependency edges in the traces say. The service named in everyone else's errors, whose metrics inflected before the downstream noise, is the suspect, even with a fraction of the error volume. That's multi-signal root cause.

The "transient" that fires every Tuesday at 4am

An alert has fired every Tuesday at 4:00am for two weeks. Same service, same signature, gone in eight minutes before anyone looks. It shows up as "resolved" in the morning review and gets waved off as a blip.

It's not a blip. It's a pattern: same hour, same duration, same signature, eight Tuesdays running. And it almost always traces to a scheduled job, a backup window, or a cron-driven side effect nobody connected to the symptom. Detection that remembers recurrence surfaces it as one recurring incident with its full history, instead of eight isolated transients that each looked harmless alone. Small recurring incidents are how serious problems hide, right up until the day they stop auto-resolving.

Why detection has to come first

There's a tempting shortcut: bolt an LLM onto a search bar and call it AI observability. It's genuinely useful for *explaining* an incident. It can't find one that never surfaced, though. An assistant on top of search only sees what your detection layer already flagged. There's no error message to summarize for a service that went silent. No anomaly to explain in latency that never crossed a threshold. No "what changed" to ask about a deploy that broke something subtle.

Detection happens first. Explanation comes after. Get that order wrong and the AI just narrates the same incidents you were already going to triage — and stays blind to the ones in this guide.

We built Epok around that order. Point it at your telemetry (logs, metrics, traces, infrastructure, RUM) and it learns what normal looks like per service, then watches for *change*: new error fingerprints within minutes of a deploy, services going quiet across log, trace, and metric signals, latency drifting off its weekly baseline, cascades collapsed into one cited incident. No rules to write, no thresholds to tune. When something fires, it arrives with the related signals already grouped and each claim in the root cause linked to the exact log line, span, or metric behind it.

The incidents that cost you most aren't the ones that page. They're the ones that don't. Start a 14-day trial — every feature, no rules to write — at getepok.dev (verify current pricing at getepok.dev/pricing).

FAQ

What are "missed incidents" in observability?

Missed incidents are production failures that don't trip a configured threshold alert, so nothing pages. The common classes are silent services (a process dies and its signal just stops), slow cascades (an upstream fault that surfaces as symptoms in downstream services), brand-new errors (a message with no historical baseline), and latency drift (a slow climb that never crosses a fixed line). They're the failures static thresholds and dashboards are structurally unable to catch.

Why don't threshold alerts catch these incidents?

A threshold fires when a named metric crosses a fixed line. That works for failures you can predict and size in advance, but it can't catch absence (zero errors looks healthy), novelty (no baseline exists for an error that's never appeared), or drift (the number is climbing but hasn't crossed the line yet). Catching those requires comparing current behavior against a learned baseline of what's normal for each service — not a static limit.

What's the difference between detection-first and search-first observability?

Search-first tools store everything and hand you a query bar; you have to know what to look for and go hunting. Detection-first tools watch your signals continuously, learn each service's baseline, and tell you when something deviates — including absence and drift — before you think to ask. Search explains incidents you already found; detection finds the ones you didn't.

Can AI on top of log search find incidents that didn't alert?

Not on its own. An LLM layered over search can only summarize what your detection layer surfaced to it, so it inherits the same blind spots: a silent service has no log line to summarize, and drift that never crossed a threshold produces no anomaly to explain. The detection has to find the incident first; the AI is most useful for explaining it afterward.

How does Epok catch incidents that don't trip a threshold?

Epok learns a per-service, per-time-of-day baseline across logs, metrics, traces, infrastructure, and RUM, then flags deviations from it: new error fingerprints, services going quiet across multiple signals, latency drift off the weekly shape, and cascades grouped into a single incident. First alerts land within minutes; full anomaly coverage builds over the first few days as baselines learn your traffic. There are no rules to write or thresholds to tune.

Try Epok free. First alerts in minutes.

No credit card. Every detector included, root cause on every incident. Full baseline coverage at 7 days.

Start 14-day trial

Catch New Errors in Production Before Your Users Do

The errors that take you down are the ones you've never seen before. Here's how automatic fingerprinting catches new errors in production on the first occurrence.

Stop Building Monitoring by Hand

Static thresholds and hand-built dashboards both rot. Why you should stop writing alert rules and let detection learn each service's baseline instead.

Silent Failures: The Bug That Won't Page You

A silent failure is when a service stops logging and no alert fires. Here's how absence detection catches the bug that never throws an error.

Why Your AWS Logging Bill Is Out of Control

Stop Building Monitoring by Hand