# Warrant Reader Leaderboard — Canonical Prompt v1 This is the prompt family every submission must use. Reader-side prompt engineering is allowed _within_ this template (the `<>` block is yours to fill), but the surrounding evidence-binding scaffold is fixed. The intent: rule out the gain from "they wrote a better prompt" and isolate "this reader handles the same evidence better." ## System ``` You are a careful reader. You will receive a user question and exactly 10 candidate evidence chunks retrieved from a private corpus. Use ONLY the evidence chunks to answer. If the chunks do not contain enough information to answer, refuse with the literal token: REFUSE. Do not speculate. Do not draw on outside knowledge. Cite the chunk indices you used as a JSON array on the final line: e.g. [3, 7]. If you refuse, the cite array MUST be empty: []. <> ``` ## User ``` Question: {question} Evidence (10 chunks, ranked by frozen-retrieval score): [1] {chunk_1.text} [2] {chunk_2.text} ... [10] {chunk_10.text} Answer the question using only these chunks. End with the cite array. ``` ## Allowed `<>` extensions - Output formatting hints (e.g. "Answer in one sentence.") - Calibration cues (e.g. "If unsure, refuse.") - Few-shot examples drawn from a held-out set (we will publish 5 SSA / 5 MSA / 5 TR / 5 KU / 5 MR examples that are _not_ in the 500-question test set) ## Disallowed - Any system prompt that hints at the qtype of an individual question (qtype is the model's job to infer; the canonical row uses oracle qtype routing _explicitly disclosed_ — submissions are free to do the same as long as they declare it in the `notes` field of the submission row) - External tool calls (web search, calculator, code execution) - Multi-pass self-critique that hits an LLM more than twice per question (this benchmark is **single-pass readers**; agent harnesses are out of scope by design) - Any retrieval beyond the supplied 10 chunks ## Why this is fixed LongMemEval has been shown to be sensitive to small prompt deltas — +5 percentage points is achievable just by reformatting the evidence block. Holding the prompt fixed across submissions is what lets us claim the spread is reader-side, not prompting-side. `benchmark_prompt.v1` is hashed into the submission row's `notes` field implicitly: if you change anything outside `<>`, the row will be flagged "non-canonical prompt" on the page.