Submission harness · v0 scaffold · v1 in progress

Submit a reader.

Same evidence. Different readers. Same judge.

The Warrant Reader Leaderboard is a frozen-retrieval benchmark: every submission consumes the identical 500-question evidence file (R@5 = 96.2%), uses the canonical prompt template, and is judged by the same GPT-4o K=5 3-of-5 majority-vote protocol. Reader quality is the variable; retrieval, prompt, and judge are held fixed. The protocol is stable. The runner stubs and schema are below. The frozen retrieval artifact is in final QA — email for early access.

Schema published Prompt published Frozen artifact pending Judge service pending GitHub repo pending

Five-step submission flow.

The CLI shape and JSONL format are stable: a v0 row written today will be wire-compatible with v1. The internal calls (reader backend, judge call) are stubs in the v0 runner; v1 wires them against the public OpenAI API and configurable HuggingFace / vLLM backends.

01

Install & fetch artifacts

Clone the harness, install Python deps, and fetch the frozen retrieval artifact. The fetcher verifies SHA-256 against the published manifest before it lets the reader touch the bytes.

pip install -r runner/requirements.txt
python runner/fetch_artifacts.py
02

Run your reader

Point at any HuggingFace model id, vLLM endpoint, or OpenAI / Anthropic / Mistral API. Reader URI schemes documented in the runner.

python runner/run_reader.py \
  --reader hf://meta-llama/Llama-3.1-8B-Instruct \
  --artifact artifacts/frozen_retrieval_topK_500q.v1.jsonl \
  --prompt   artifacts/benchmark_prompt.v1.md \
  --out      submissions/llama-3.1-8b.jsonl
03

Judge with GPT-4o K=5

K=5 GPT-4o seeds, 3-of-5 majority vote. ~$3 per submission at current OpenAI list prices. Or ship pre-computed judge logs and we’ll verify them.

python runner/judge.py \
  --submission submissions/llama-3.1-8b.jsonl \
  --out        submissions/llama-3.1-8b.judged.jsonl
04

Score & emit a row

Aggregator computes overall accuracy, 95% Wilson CI, refusal rate, and per-qtype breakdown. Output validates against submission_schema.json.

python runner/score.py \
  --judged submissions/llama-3.1-8b.judged.jsonl \
  --out    submissions/llama-3.1-8b.row.json
05

Submit the row

Email the *.row.json + judge logs to the maintainer. We re-run the judge on a sample (5%) to verify, then ingest into the leaderboard manifest.

contact@manifoldmemory.ai
subject: Warrant Reader Leaderboard
        — submission

Public artifacts.

Everything the protocol needs is mirrored at manifoldmemory.ai/warrant-leaderboard/. The frozen retrieval JSONL and judge service are pending publication; the rest is live.

submission_schema.json
Live
JSON-Schema (draft 2020-12) for one leaderboard row. Every existing row validates against this schema.
benchmark_prompt.v1.md
Live
Canonical prompt template. Reader instructions are yours; the evidence-binding scaffold is fixed.
example_submission_row.json
Live
A worked example submission row. Useful for sanity-checking your scorer output before you email.
leaderboard.json
Live
Current standings as a single JSON manifest. Versioned, includes the frozen retrieval contract, R@5, and every row with full per-qtype breakdown.
README.md
Live
Protocol summary, track taxonomy, and the canonical CLI sequence.
runner/run_reader.py, judge.py, score.py
Live
CLI stubs. The argument shape and JSONL format are stable; the reader and judge backends are wired in v1.
frozen_retrieval_topK_500q.v1.jsonl
Pending
~12 MB JSONL: 500 questions, each with its top-10 chunks, qtype, and gold label. In final QA. Email for early access.
Email contact@manifoldmemory.ai
frozen_retrieval_manifest.json
Pending
Manifest with SHA-256, size, retrieval pipeline definition, and R@5 receipt. Ships alongside the JSONL.
Pending
Hosted judge endpoint
Pending
Optional: a maintainer-hosted GPT-4o K=5 judge service for submitters who don’t want to run the judge themselves. v1.
Pending
Read this first

This is a reader-only benchmark. Submissions cannot bring their own retrieval — that’s the entire point. If your system performs additional retrieval beyond the supplied 10 chunks, does multi-pass self-critique that hits an LLM more than twice per question, or makes external tool calls, it is out of scope for this leaderboard. Agent harnesses are a separate problem and we will not pretend the comparison is apples-to-apples.

Tracks.

Open-weights rows are ranked. Closed/API rows ship as a reference track — same retrieval contract, same judge, but reproducibility is limited to whatever the API serves on the run date.

canonical
Ranked
The number maintainers stand behind publicly. Reserved; does not accept community submissions.
Maintainers only
experimental
Ranked
Routed or prompt variants. Allowed on the board but not the public headline. Must declare router and prompt deltas in notes.
Open for submissions
open-swap
Ranked
Same canonical pipeline, different open-weights reader. The natural slot for new HF model releases.
Open for submissions
retrieval-only
Ranked
Stack handed straight to the reader. No qtype routing, no full-context fallback.
Open for submissions
no-retrieval
Ranked
Reader receives the full LME-S 110K-token haystack and zero retrieved chunks. Floor row.
Open for submissions
closed-reference
Reference
Closed-weights API reader on the identical retrieval contract. Not ranked; reproducibility limited to the run date.
Open for submissions
Early access
Want to be a v0 row before the harness ships publicly?

Email the maintainer with the reader you want to evaluate. We will share the frozen retrieval artifact, run the canonical prompt + judge, and ingest your row into the next manifest update. First five submissions get top-of-page acknowledgement.

Request early access →