BENCHMARKS

Numbers, with the dataset.

Every benchmark is reproducible. Public datasets, published methodology, full setup on each card.

94%

Precision

Loghub HDFS · 2M lines · warm tenant

88%

Recall

OpenRCA · packet-loss subset

100 / 100

Precision + Recall

6 rule packs · our regression suite

STATISTICAL PRECISION

Loghub HDFS v1

setupPublic Loghub HDFS_v1 (2M lines) replayed at 18.6K events/sec into a warm tenant (pre-seeded baseline).

DETECTORS ACTIVE

new_errorlog_rateerror_ratedependency_intelligence

PRECISION

94%

147 alerts · 8 false positives

DETECTION TIME

147 s

to first signal

EVENT RATE

18.6K/sec

sustained · single tenant

INJECTED FAULTS

all caught · 3 detectors fired

method · loghub-hdfs-v1 replay≈ 4 min · output: junit + csv

The Loghub HDFS_v1 corpus is a standard reference dataset for log anomaly research, drawn from a Hadoop Distributed File System running in production at a large scale. We replay the full 2M-line corpus at 18.6K events/second into a warm Epok tenant — meaning the statistical detectors have already learned a stable baseline.

Two anomalies are injected at known timestamps. All detectors run in parallel; we count an alert as a true positive only if it fires within the injection window with a fingerprint matching the seeded pattern. False positives are alerts that fire outside the injection window or on noise patterns.

ROOT-CAUSE RECALL

OpenRCA Market

setupOpenRCA Market e-commerce microservices benchmark. Network-fault cases replayed against cold-start tenants.

DETECTORS ACTIVE

silencenew_errorgolden_signalsdependency_intelligence

PACKET LOSS · RECALL

88%

22 of 25 cases caught

PACKET CORRUPTION · RECALL

55%

6 of 11 cases caught

CONTAINER TERMINATION

4 / 4

silence detector · 100%

FALSE POSITIVES

across all measured cases

method · openrca-market replay≈ 18 min · per-case csv output

OpenRCA Market is a published microservices benchmark designed for root-cause analysis evaluation, with labeled ground truth for each injected fault. Unlike Loghub, this is a cold-start scenario — each case starts a fresh tenant, replays the trace, and measures whether Epok's detectors fire before the labeled symptom window closes.

Median detection latency across measured fault classes is 8.6 minutes— well inside the symptom window. Container termination cases hit 4/4 because silence detection requires no learned baseline: a service that should be logging and isn't is a fact, not a forecast.

WHAT WE DON'T CATCH ON THIS BENCHMARK

Latency / retransmission cases · 0% recall. These fault classes produce no HTTP error signal in the source data, so log-shaped detection has nothing to fire on. We report this here because it's the honest answer; metric- and trace-shaped signals catch these and are a separate roadmap line.

DETERMINISTIC GROUND TRUTH

Internal rule-pack regression suite

setupLabeled corpus across six rule-pack categories. Positive cases must fire. Negative cases must stay silent. Both must pass.

DETECTORS ACTIVE

securitywebdatabasedependencysearchinfrastructure

PRECISION

100%

across all 6 packs

RECALL

100%

across all 6 packs

CI GATE

every PR

regression fails the build

RUNTIME

4 s

runs on every change

method · rule-pack regression suiteruns on every pull request · regression fails CI

A statistical detector that runs at 100% precision and 100% recall doesn't exist. A rule pack — a curated set of patterns encoding hard-won incident knowledge for a specific domain — can. Not because the rules are clever, because they're deterministic and the corpus tells the truth.

Every change runs the regression suite. A pack that drops below 100/100 on any category fails the test and the change doesn't merge. A refactor that accidentally weakens an existing rule fails CI. A new rule that catches a new failure mode but breaks an existing case fails CI. The "100% precision and recall" claim on the marketing site stays honest because the test runs in 4 seconds on every change — not quarterly, not aspirationally.

Read the full engineering write-up: Six Rule Packs at 100% Precision and 100% Recall →

04METHODOLOGY

Why precision is the budget, recall is the spec.

Precision is the floor.

A detector that fires on 90% of brute-force attacks but also fires twice a week on legitimate SSH from a sleepy laptop gets muted within a month. Recall above the precision floor is irrelevant — nobody is listening to the alerts.

Datasets are public.

Every dataset cited above is publicly available. Setup and methodology are published per benchmark. Every result is reproducible on a 4-vCPU runner. No closed corpora. No proprietary harness. No cherry-picked windows.

PRINCIPLE

If a number isn't reproducible by an outsider with public data on commodity hardware, we don't publish it.

METHODOLOGY · RAW DATA

Audit the numbers yourself.

Every benchmark card above lists the exact script under method ·. Datasets are public, runners commodity, and the harness is open.

DATASETS

Public corpora only

Loghub HDFS_v1 (2M lines, LogPAI), Loghub BGL (4.7M lines), OpenRCA Market (70 microservice incidents). No closed corpora. No proprietary capture.

HARNESS

Open scripts in repo

Each script (loghub_precision_recall.py, openrca_market_replay.py, rule_packs PR baseline) lives in scripts/ in the public repo. Runnable on any 4-vCPU box.

REPRO

One-line invocation

Clone the repo, set a tenant key, run the script. No data prep, no proprietary loaders. Numbers match the cards above to within seed variance.

Want a number we haven't published? Ask us to run it →

Want to run a benchmark against your own logs?

Start 14-day trial →