Benchmarks

The receipt wall โ€”
explained.

What each benchmark tests, what the number means, and the number itself. No runtime LLM in the decision path. SIGNAL is PHAROS’s benchmark; VANTAGE is OMNIS’s code-audit receipt โ€” benchmarks on the glass, products in the nav.

Primary seven

VANTAGE9/9

Tests: deterministic code audit โ€” reject-first verification on sealed fixture repos (package hygiene, runtime danger, architecture shape, duplicate families, false-alarm discipline).

Means: all nine VANTAGE 2.0 fixture suites cleared ยท 15/15 expected findings recovered ยท 0 forbidden false-positive classes ยท 0 severity mismatches. Product = OMNIS / VANTAGE; this is the internal product-validation receipt, not a public held-out F1 claim.

SIGNALF1 0.639

Tests: pharmacovigilance adverse drug event (ADE) extraction from clinical narratives and reports.

Means: harmonic mean of precision (0.712) and recall (0.580) on the sealed COSMIC task โ€” 24.3-month median detection window. Product = PHAROS; SIGNAL is the benchmark receipt.

CITADELF1 0.616

Tests: financial-compliance entity extraction โ€” corporate subsidiary hierarchy reconstruction from SEC Exhibit 21 filings.

Means: F1 on 400-entity corpus with SHA-verified ground truth. Checkpoint arc E โ†’ E.2 documented. Corpus SHA-256 (E.2): a6a98dbbโ€ฆdab99d81.

SENTINEL94.0%

Tests: security operations center alert triage โ€” classify and prioritize SOC alerts with HEIMDALL confidence gate.

Means: held-out accuracy on the HEIMDALL classifier. No LLM at runtime. Honest refusal counted in the methodology.

ORACLE51%

Tests: cross-domain factual verification against a sealed knowledge base โ€” verified, refuted, or refused.

Means: 51% overall vs 31% / 25% always-confident baselines. Refusals on 67.5% of the 200-claim corpus counted correctly. Corpus SHA-256: cd5de198โ€ฆ1911ad2.

LENS25ร—

Tests: deterministic semantic search over codebases โ€” intent-based retrieval without embedding models at runtime.

Means: 25ร— improvement vs grep on intent queries (P@5 0.250 deterministic on published task).

COMPASS15/15

Tests: document reading-level calibration โ€” does output land within one tier of target?

Means: 15/15 within-one-tier on the sealed calibration set.

Memory ยท RAVEN v1.1

MUNINNF1 0.847

Tests: memory validation โ€” contradiction detection, importance ranking, honest refusal in memory pipelines.

Means: validation F1 0.847, recall 0.921 on published memory-validation corpus.

DECAY100%

Tests: per-class memory decay โ€” facts, preferences, and identity claims age on different clocks.

Means: 100% decay-aware recall across 310 queries.

REFUSAL100%

Tests: structured memory refusal โ€” five typed reasons with recommended actions and audit hashes.

Means: 100% precision on typed refusal taxonomy.

Posture

No runtime LLM in the decision path. Honest refusals are counted, not hidden. Reproduce from published corpus seals โ€” not from our word.