Submit a reader — Warrant Reader Leaderboard

Five-step submission flow.

The CLI shape and JSONL format are stable: a v0 row written today will be wire-compatible with v1. The internal calls (reader backend, judge call) are stubs in the v0 runner; v1 wires them against the public OpenAI API and configurable HuggingFace / vLLM backends.

Install & fetch artifacts

Clone the harness, install Python deps, and fetch the frozen retrieval artifact. The fetcher verifies SHA-256 against the published manifest before it lets the reader touch the bytes.

pip install -r runner/requirements.txt
python runner/fetch_artifacts.py

Run your reader

Point at any HuggingFace model id, vLLM endpoint, or OpenAI / Anthropic / Mistral API. Reader URI schemes documented in the runner.

python runner/run_reader.py \
  --reader hf://meta-llama/Llama-3.1-8B-Instruct \
  --artifact artifacts/frozen_retrieval_topK_500q.v1.jsonl \
  --prompt   artifacts/benchmark_prompt.v1.md \
  --out      submissions/llama-3.1-8b.jsonl

Judge with GPT-4o K=5

K=5 GPT-4o seeds, 3-of-5 majority vote. ~$3 per submission at current OpenAI list prices. Or ship pre-computed judge logs and we’ll verify them.

python runner/judge.py \
  --submission submissions/llama-3.1-8b.jsonl \
  --out        submissions/llama-3.1-8b.judged.jsonl

Score & emit a row

Aggregator computes overall accuracy, 95% Wilson CI, refusal rate, and per-qtype breakdown. Output validates against submission_schema.json.

python runner/score.py \
  --judged submissions/llama-3.1-8b.judged.jsonl \
  --out    submissions/llama-3.1-8b.row.json

Submit the row

Email the *.row.json + judge logs to the maintainer. We re-run the judge on a sample (5%) to verify, then ingest into the leaderboard manifest.

contact@manifoldmemory.ai
subject: Warrant Reader Leaderboard
        — submission

Public artifacts.

Everything the protocol needs is mirrored at manifoldmemory.ai/warrant-leaderboard/. The frozen retrieval JSONL and judge service are pending publication; the rest is live.

submission_schema.json

Live

JSON-Schema (draft 2020-12) for one leaderboard row. Every existing row validates against this schema.

View →

benchmark_prompt.v1.md

Live

Canonical prompt template. Reader instructions are yours; the evidence-binding scaffold is fixed.

View →

example_submission_row.json

Live

A worked example submission row. Useful for sanity-checking your scorer output before you email.

View →

leaderboard.json

Live

Current standings as a single JSON manifest. Versioned, includes the frozen retrieval contract, R@5, and every row with full per-qtype breakdown.

View →

README.md

Live

Protocol summary, track taxonomy, and the canonical CLI sequence.

View →

runner/run_reader.py, judge.py, score.py

Live

CLI stubs. The argument shape and JSONL format are stable; the reader and judge backends are wired in v1.

run_reader →

frozen_retrieval_topK_500q.v1.jsonl

Pending

~12 MB JSONL: 500 questions, each with its top-10 chunks, qtype, and gold label. In final QA. Email for early access.

Email contact@manifoldmemory.ai

frozen_retrieval_manifest.json

Pending

Manifest with SHA-256, size, retrieval pipeline definition, and R@5 receipt. Ships alongside the JSONL.

Pending

Hosted judge endpoint

Pending

Optional: a maintainer-hosted GPT-4o K=5 judge service for submitters who don’t want to run the judge themselves. v1.

Pending

Read this first

This is a reader-only benchmark. Submissions cannot bring their own retrieval — that’s the entire point. If your system performs additional retrieval beyond the supplied 10 chunks, does multi-pass self-critique that hits an LLM more than twice per question, or makes external tool calls, it is out of scope for this leaderboard. Agent harnesses are a separate problem and we will not pretend the comparison is apples-to-apples.

Tracks.

Open-weights rows are ranked. Closed/API rows ship as a reference track — same retrieval contract, same judge, but reproducibility is limited to whatever the API serves on the run date.

canonical

Ranked

The number maintainers stand behind publicly. Reserved; does not accept community submissions.

Maintainers only

experimental

Ranked

Routed or prompt variants. Allowed on the board but not the public headline. Must declare router and prompt deltas in notes.

Open for submissions

open-swap

Ranked

Same canonical pipeline, different open-weights reader. The natural slot for new HF model releases.

Open for submissions

retrieval-only

Ranked

Stack handed straight to the reader. No qtype routing, no full-context fallback.

Open for submissions

no-retrieval

Ranked

Reader receives the full LME-S 110K-token haystack and zero retrieved chunks. Floor row.

Open for submissions

closed-reference

Reference

Closed-weights API reader on the identical retrieval contract. Not ranked; reproducibility limited to the run date.

Open for submissions