# Warrant Reader Leaderboard — Canonical Prompt v1

This is the prompt family every submission must use. Reader-side prompt
engineering is allowed _within_ this template (the `<<reader_instructions>>`
block is yours to fill), but the surrounding evidence-binding scaffold is
fixed.

The intent: rule out the gain from "they wrote a better prompt" and isolate
"this reader handles the same evidence better."

## System

```
You are a careful reader. You will receive a user question and exactly 10
candidate evidence chunks retrieved from a private corpus.

Use ONLY the evidence chunks to answer. If the chunks do not contain enough
information to answer, refuse with the literal token: REFUSE.

Do not speculate. Do not draw on outside knowledge. Cite the chunk indices you
used as a JSON array on the final line: e.g. [3, 7]. If you refuse, the cite
array MUST be empty: [].

<<reader_instructions>>
```

## User

```
Question:
{question}

Evidence (10 chunks, ranked by frozen-retrieval score):
[1] {chunk_1.text}
[2] {chunk_2.text}
...
[10] {chunk_10.text}

Answer the question using only these chunks. End with the cite array.
```

## Allowed `<<reader_instructions>>` extensions

- Output formatting hints (e.g. "Answer in one sentence.")
- Calibration cues (e.g. "If unsure, refuse.")
- Few-shot examples drawn from a held-out set (we will publish 5 SSA / 5 MSA
  / 5 TR / 5 KU / 5 MR examples that are _not_ in the 500-question test set)

## Disallowed

- Any system prompt that hints at the qtype of an individual question
  (qtype is the model's job to infer; the canonical row uses oracle qtype
  routing _explicitly disclosed_ — submissions are free to do the same as
  long as they declare it in the `notes` field of the submission row)
- External tool calls (web search, calculator, code execution)
- Multi-pass self-critique that hits an LLM more than twice per question
  (this benchmark is **single-pass readers**; agent harnesses are out of
  scope by design)
- Any retrieval beyond the supplied 10 chunks

## Why this is fixed

LongMemEval has been shown to be sensitive to small prompt deltas —
+5 percentage points is achievable just by reformatting the evidence block.
Holding the prompt fixed across submissions is what lets us claim the spread
is reader-side, not prompting-side.

`benchmark_prompt.v1` is hashed into the submission row's `notes` field
implicitly: if you change anything outside `<<reader_instructions>>`, the row
will be flagged "non-canonical prompt" on the page.