Benchmarks

The receipt wall —
explained.

What each benchmark tests, what the number means, and the number itself. No runtime LLM in the decision path. SIGNAL is PHAROS’s benchmark; VANTAGE is OMNIS’s code-audit receipt — benchmarks on the glass, products in the nav.

Primary seven

VANTAGE9/9

Tests: deterministic code audit — reject-first verification on sealed fixture repos (package hygiene, runtime danger, architecture shape, duplicate families, false-alarm discipline).

Means: all nine VANTAGE 2.0 fixture suites cleared · 15/15 expected findings recovered · 0 forbidden false-positive classes · 0 severity mismatches. Product = OMNIS / VANTAGE; this is the internal product-validation receipt, not a public held-out F1 claim.

SIGNALF1 0.639

Tests: pharmacovigilance adverse drug event (ADE) extraction from clinical narratives and reports.

Means: harmonic mean of precision (0.712) and recall (0.580) on the sealed COSMIC task — 24.3-month median detection window. Product = PHAROS; SIGNAL is the benchmark receipt.

CITADELF1 0.616

Tests: financial-compliance entity extraction — corporate subsidiary hierarchy reconstruction from SEC Exhibit 21 filings.

Means: F1 on 400-entity corpus with SHA-verified ground truth. Checkpoint arc E → E.2 documented. Corpus SHA-256 (E.2): a6a98dbb…dab99d81.

SENTINEL94.0%

Tests: security operations center alert triage — classify and prioritize SOC alerts with HEIMDALL confidence gate.

Means: held-out accuracy on the HEIMDALL classifier. No LLM at runtime. Honest refusal counted in the methodology.

ORACLE51%

Tests: cross-domain factual verification against a sealed knowledge base — verified, refuted, or refused.

Means: 51% overall vs 31% / 25% always-confident baselines. Refusals on 67.5% of the 200-claim corpus counted correctly. Corpus SHA-256: cd5de198…1911ad2.

LENS25×

Tests: deterministic semantic search over codebases — intent-based retrieval without embedding models at runtime.

Means: 25× improvement vs grep on intent queries (P@5 0.250 deterministic on published task).

COMPASS15/15

Tests: document reading-level calibration — does output land within one tier of target?

Means: 15/15 within-one-tier on the sealed calibration set.

Memory · RAVEN v1.1

MUNINNF1 0.847

Tests: memory validation — contradiction detection, importance ranking, honest refusal in memory pipelines.

Means: validation F1 0.847, recall 0.921 on published memory-validation corpus.

DECAY100%

Tests: per-class memory decay — facts, preferences, and identity claims age on different clocks.

Means: 100% decay-aware recall across 310 queries.

REFUSAL100%

Tests: structured memory refusal — five typed reasons with recommended actions and audit hashes.

Means: 100% precision on typed refusal taxonomy.

Posture

No runtime LLM in the decision path. Honest refusals are counted, not hidden. Reproduce from published corpus seals — not from our word.

OMNIS / VANTAGE → PHAROS → HEIMDALL →

The receipt wall —explained.

The receipt wall —
explained.