Numbers, with the dataset.
Every benchmark is reproducible. Public datasets, published methodology, full setup on each card.
Loghub HDFS v1
The Loghub HDFS_v1 corpus is a standard reference dataset for log anomaly research, drawn from a Hadoop Distributed File System running in production at a large scale. We replay the full 2M-line corpus at 18.6K events/second into a warm Epok tenant — meaning the statistical detectors have already learned a stable baseline.
Two anomalies are injected at known timestamps. All detectors run in parallel; we count an alert as a true positive only if it fires within the injection window with a fingerprint matching the seeded pattern. False positives are alerts that fire outside the injection window or on noise patterns.
OpenRCA Market
OpenRCA Market is a published microservices benchmark designed for root-cause analysis evaluation, with labeled ground truth for each injected fault. Unlike Loghub, this is a cold-start scenario — each case starts a fresh tenant, replays the trace, and measures whether Epok's detectors fire before the labeled symptom window closes.
Median detection latency across measured fault classes is 8.6 minutes— well inside the symptom window. Container termination cases hit 4/4 because silence detection requires no learned baseline: a service that should be logging and isn't is a fact, not a forecast.
Latency / retransmission cases · 0% recall. These fault classes produce no HTTP error signal in the source data, so log-shaped detection has nothing to fire on. We report this here because it's the honest answer; metric- and trace-shaped signals catch these and are a separate roadmap line.
Internal rule-pack regression suite
A statistical detector that runs at 100% precision and 100% recall doesn't exist. A rule pack — a curated set of patterns encoding hard-won incident knowledge for a specific domain — can. Not because the rules are clever, because they're deterministic and the corpus tells the truth.
Every change runs the regression suite. A pack that drops below 100/100 on any category fails the test and the change doesn't merge. A refactor that accidentally weakens an existing rule fails CI. A new rule that catches a new failure mode but breaks an existing case fails CI. The "100% precision and recall" claim on the marketing site stays honest because the test runs in 4 seconds on every change — not quarterly, not aspirationally.
Read the full engineering write-up: Six Rule Packs at 100% Precision and 100% Recall →
Why precision is the budget, recall is the spec.
Precision is the floor.
A detector that fires on 90% of brute-force attacks but also fires twice a week on legitimate SSH from a sleepy laptop gets muted within a month. Recall above the precision floor is irrelevant — nobody is listening to the alerts.
Datasets are public.
Every dataset cited above is publicly available. Setup and methodology are published per benchmark. Every result is reproducible on a 4-vCPU runner. No closed corpora. No proprietary harness. No cherry-picked windows.
PRINCIPLE
If a number isn't reproducible by an outsider with public data on commodity hardware, we don't publish it.
Audit the numbers yourself.
Every benchmark card above lists the exact script under method ·. Datasets are public, runners commodity, and the harness is open.
Want a number we haven't published? Ask us to run it →