epok
BENCHMARKS

Numbers, with the dataset.

Every benchmark is reproducible. Public datasets, published methodology, full setup on each card.

94%
Precision
Loghub HDFS · 2M lines · warm tenant
88%
Recall
OpenRCA · packet-loss subset
100 / 100
Precision + Recall
6 rule packs · our regression suite
01
STATISTICAL PRECISION

Loghub HDFS v1

setupPublic Loghub HDFS_v1 (2M lines) replayed at 18.6K events/sec into a warm tenant (pre-seeded baseline).
DETECTORS ACTIVE
new_errorlog_rateerror_ratedependency_intelligence
PRECISION
94%
147 alerts · 8 false positives
DETECTION TIME
147 s
to first signal
EVENT RATE
18.6K/sec
sustained · single tenant
INJECTED FAULTS
2
all caught · 3 detectors fired
method · loghub-hdfs-v1 replay≈ 4 min · output: junit + csv

The Loghub HDFS_v1 corpus is a standard reference dataset for log anomaly research, drawn from a Hadoop Distributed File System running in production at a large scale. We replay the full 2M-line corpus at 18.6K events/second into a warm Epok tenant — meaning the statistical detectors have already learned a stable baseline.

Two anomalies are injected at known timestamps. All detectors run in parallel; we count an alert as a true positive only if it fires within the injection window with a fingerprint matching the seeded pattern. False positives are alerts that fire outside the injection window or on noise patterns.

02
ROOT-CAUSE RECALL

OpenRCA Market

setupOpenRCA Market e-commerce microservices benchmark. Network-fault cases replayed against cold-start tenants.
DETECTORS ACTIVE
silencenew_errorgolden_signalsdependency_intelligence
PACKET LOSS · RECALL
88%
22 of 25 cases caught
PACKET CORRUPTION · RECALL
55%
6 of 11 cases caught
CONTAINER TERMINATION
4 / 4
silence detector · 100%
FALSE POSITIVES
0
across all measured cases
method · openrca-market replay≈ 18 min · per-case csv output

OpenRCA Market is a published microservices benchmark designed for root-cause analysis evaluation, with labeled ground truth for each injected fault. Unlike Loghub, this is a cold-start scenario — each case starts a fresh tenant, replays the trace, and measures whether Epok's detectors fire before the labeled symptom window closes.

Median detection latency across measured fault classes is 8.6 minutes— well inside the symptom window. Container termination cases hit 4/4 because silence detection requires no learned baseline: a service that should be logging and isn't is a fact, not a forecast.

WHAT WE DON'T CATCH ON THIS BENCHMARK

Latency / retransmission cases · 0% recall. These fault classes produce no HTTP error signal in the source data, so log-shaped detection has nothing to fire on. We report this here because it's the honest answer; metric- and trace-shaped signals catch these and are a separate roadmap line.

03
DETERMINISTIC GROUND TRUTH

Internal rule-pack regression suite

setupLabeled corpus across six rule-pack categories. Positive cases must fire. Negative cases must stay silent. Both must pass.
DETECTORS ACTIVE
securitywebdatabasedependencysearchinfrastructure
PRECISION
100%
across all 6 packs
RECALL
100%
across all 6 packs
CI GATE
every PR
regression fails the build
RUNTIME
4 s
runs on every change
method · rule-pack regression suiteruns on every pull request · regression fails CI

A statistical detector that runs at 100% precision and 100% recall doesn't exist. A rule pack — a curated set of patterns encoding hard-won incident knowledge for a specific domain — can. Not because the rules are clever, because they're deterministic and the corpus tells the truth.

Every change runs the regression suite. A pack that drops below 100/100 on any category fails the test and the change doesn't merge. A refactor that accidentally weakens an existing rule fails CI. A new rule that catches a new failure mode but breaks an existing case fails CI. The "100% precision and recall" claim on the marketing site stays honest because the test runs in 4 seconds on every change — not quarterly, not aspirationally.

Read the full engineering write-up: Six Rule Packs at 100% Precision and 100% Recall →

04METHODOLOGY

Why precision is the budget, recall is the spec.

Precision is the floor.

A detector that fires on 90% of brute-force attacks but also fires twice a week on legitimate SSH from a sleepy laptop gets muted within a month. Recall above the precision floor is irrelevant — nobody is listening to the alerts.

Datasets are public.

Every dataset cited above is publicly available. Setup and methodology are published per benchmark. Every result is reproducible on a 4-vCPU runner. No closed corpora. No proprietary harness. No cherry-picked windows.

PRINCIPLE

If a number isn't reproducible by an outsider with public data on commodity hardware, we don't publish it.

METHODOLOGY · RAW DATA

Audit the numbers yourself.

Every benchmark card above lists the exact script under method ·. Datasets are public, runners commodity, and the harness is open.

DATASETS

Public corpora only

Loghub HDFS_v1 (2M lines, LogPAI), Loghub BGL (4.7M lines), OpenRCA Market (70 microservice incidents). No closed corpora. No proprietary capture.

HARNESS

Open scripts in repo

Each script (loghub_precision_recall.py, openrca_market_replay.py, rule_packs PR baseline) lives in scripts/ in the public repo. Runnable on any 4-vCPU box.

REPRO

One-line invocation

Clone the repo, set a tenant key, run the script. No data prep, no proprietary loaders. Numbers match the cards above to within seed variance.

Want a number we haven't published? Ask us to run it →

Want to run a benchmark against your own logs?

Start 14-day trial →