epok
·8 min read

The Incidents That Hide Between Alerts

Six classes of production failure that don't trip threshold alerts and don't show up in AI-summarized log feeds — but cost engineering teams real money every week. What Epok catches that the rest of your observability stack will quietly miss.

detectionincidentsproductionobservabilityalerting

Every team running production has the same conversation at the post-mortem: "Why didn't anything fire?" The dashboards looked green. The alerts didn't page. The error rate was within tolerance. And yet the incident lasted four hours and cost the company a customer.

Modern observability has gotten very good at telling you what you already configured it to watch for. Latency above 500ms? It pages. Error rate over 5%? It pages. CPU above 80%? It pages. The problem is that the worst incidents almost never look like that. They live in the gaps between the things you thought to alert on.

Here are six incident classes we see in production every week. None of them fire on traditional threshold alerts. Most of them don't show up in AI-summarized log feeds either — because the AI is looking at patterns that already exist, not the absence of normal behavior. All of them are exactly what Epok is built to catch.

1. The error message that has never appeared before

At 3:14am, a single line drops into the auth service logs: "connection pool exhausted after 30s waiting for slot." It appears twelve times in three minutes. Then it spreads. By morning, login traffic is silently failing for 4% of users.

Nobody had a threshold for this. Nobody could — the message had never existed in the codebase until a deploy three hours earlier. There was no historical baseline to compare against, no alert rule that matched. A noise-tolerant team would see twelve errors in five minutes and not even register it as worth investigating.

What Epok catches: a fingerprinted error message that has never appeared in your environment before — within five minutes of its first occurrence. Not a count threshold. Not a rule. The fact that this specific pattern has never existed before is the signal.

What it costs to miss: average detection time for new error classes via customer reports is over two hours. Average detection time via support ticket is over a day.

2. The service that stopped logging

A background worker that processes Stripe webhook events crashes at midnight. The container runtime kills it after an OOM. There is no error log because the kernel killed the process; the application never got a chance to write anything. The logs simply stop.

Six hours later, customer support starts getting messages about missing receipts. The worker hasn't processed an event since 23:58 the night before. Nobody got paged because no error fired. Zero errors is technically within tolerance.

What Epok catches: log streams that go quiet relative to their own historical cadence. The worker that normally produces 40 lines per minute is producing zero. That is the alert. Critically, this is calibrated per service per time-of-day — a batch job that runs hourly isn't flagged for the 59 minutes of expected silence between runs.

What it costs to miss: silent failures are the longest-running incident class we see. Average duration before detection is six hours. The same incidents detected by Epok average twelve minutes.

3. The latency that crept

Your checkout API runs at p99 80ms most days. Over the course of three hours, it drifts to 140ms. Then 200ms. Then 240ms. The fixed threshold on your alert is 500ms — set conservatively because nobody wanted to be paged every Friday afternoon.

By the time the threshold trips, you've lost two hours of conversion. Cart abandonment is already up. The slowdown is now severe enough that customers are filing tickets. You spend the next forty minutes correlating commits to figure out what changed.

What Epok catches: latency drift relative to what is normal for this service at this hour of the week. 240ms on Tuesday at 2pm is a five-standard-deviation event for your checkout API, even though it's nowhere near 500ms. The deploy that introduced the regression is automatically linked to the alert.

What it costs to miss: most teams discover latency regressions only when they're severe enough to look obviously broken. By then, the impact has already compounded.

4. The five thousand errors that are really twelve problems

A deploy goes out at 4:47pm. Within ninety seconds, your error feed lights up. Five thousand error lines in the last five minutes. You start scrolling. The exception messages are slightly different on every line — different user IDs, different request IDs, different timestamps. You can't tell whether you're looking at one issue or fifty.

You spend the next twenty minutes copy-pasting representative messages into a doc, trying to dedupe them by hand. Half your team is doing the same exercise in parallel because the on-call channel has no shared understanding of what is actually broken.

What Epok catches: the underlying distinct patterns in the noise. Five thousand error lines collapse to twelve actual problems, each with a representative example, a count, a list of affected services, and a first-seen timestamp. You triage twelve things, not five thousand.

What it costs to miss: triage time during high-severity incidents. The team that spends thirty minutes deduping errors is the team that doesn't ship a fix for forty.

5. The cascade you didn't see coming

The auth service starts returning 5xx errors at a low rate — maybe 2% of requests. Seven downstream services that depend on auth start failing in cascading ways: checkout can't validate sessions, the user service can't refresh profiles, the notification service can't resolve recipient identities.

Each of these downstream services has its own alert. Seven separate pages fire within ninety seconds of each other. The on-call engineer now has seven alerts, seven Slack channels, and seven dashboards open, trying to figure out which one is the root and which are symptoms.

What Epok catches: the dependency between these failures. The seven downstream alerts are automatically merged into one incident, with auth identified as the upstream cause based on the timing pattern, the dependency map inferred from your logs, and the correlation between the failure signatures. One incident. One root. One investigation.

What it costs to miss: alert fatigue is the obvious cost. The less obvious one is the wrong mitigation — restarting a downstream service when the root is upstream often makes the cascade worse.

6. The pattern that fires every Tuesday at 4am

An alert has been firing for two weeks. Same error pattern, same service, every Tuesday morning at 4:00am, lasting about eight minutes before resolving itself. Nobody has investigated because it auto-resolves before anyone looks. It shows up as "resolved" in the morning incident review and gets dismissed as transient.

What Epok catches: the recurrence. This isn't a transient — this is a pattern. Same hour, same duration, same signature, eight Tuesdays in a row. It surfaces as a recurring incident with the full history attached, not eight separate transient blips that each looked unimportant on their own.

What it costs to miss: small recurring incidents are how serious problems hide. The Tuesday 4am pattern is almost always a scheduled job, a backup window, or a cron-driven side effect that nobody has connected to the symptom. Until it stops auto-resolving.

Why this list doesn't get shorter with AI

There is a wave of observability products adding LLM layers on top of log search — type a question, get a summary. These are useful for explanation, but they don't change the underlying detection. The LLM only sees what gets surfaced to it. If your detection layer is still threshold-based alerting plus a search bar, the AI is summarizing the same set of incidents you were already going to triage. It is not finding incidents that didn't already fire an alert.

The six classes above are exactly the ones an AI assistant on top of search won't find for you. There is no error message to summarize for a service that went silent. There is no anomaly to explain in latency drift that didn't trip a threshold. There is no "what changed" to ask about a deploy that broke something subtle.

Detection has to happen first. The explanation comes after. That order matters.

What this looks like in practice

Point Epok at your logs. Within an hour, it starts learning what normal looks like for each of your services. Within five minutes of a deployment, it's already catching new error classes that have never appeared before. Within a day, it's calibrating per-service silence detection. Within a week, it has full weekly baselines for latency drift and volume anomalies.

You don't write rules. You don't tune thresholds. You don't pick which patterns matter. Detection runs continuously across every log stream you send. When something fires, it comes with the cluster of related signals already grouped — not seven alerts in seven channels, but one incident with the cascade attached.

Every detector described in this post is available in the 14-day trial. Full features, no credit card. Send logs in any standard format — Elasticsearch bulk, OTLP, Loki push, FluentBit, syslog, raw JSON. First alerts land within minutes; full anomaly coverage builds over the first 3 days as baselines learn your traffic.

The incidents that cost you the most aren't the ones that fire. They're the ones that hide between the alerts. The 14-day trial covers everything in this post — try it at app.getepok.dev.

Try Epok free. No credit card. First alerts in minutes; full baseline coverage at 7 days.

Every detector included. Root cause analysis on every incident. See what your logs are trying to tell you.

Start Free