Research continuity brief · audited 2026-04-25 against recovered journal · v1.0

ManifoldMemory / Warrant

A self-contained continuity brief: scope, scientific arc, production architecture, current maturity, and what a continuation team should do next.

ManifoldMemory is a research program investigating whether natural-language memory can be represented and retrieved as geometry inside frozen autoencoder latent manifolds, rather than as raw text context or contrastively trained embeddings. Warrant is the productized private-corpus retrieval system built from that research.

The core claim is not that this replaces LLMs or beats every retriever. It is narrower and stronger: a reconstruction-autoencoder latent manifold can behave like a content-addressable memory for doc-anchored QA at million-candidate scale, and a small latent-native processor/reranker can navigate that manifold in ways complementary to standard dense retrieval.

Read the arc Receipts & sources / 16 sections / every number traceable to journal or paper

§ 03 · The arc

Three scientific steps, one of them a falsification.

Each step is a paper. The first is a negative result that clears the path; the second discovers the substrate; the third engineers the navigator.

Step 1 · Negative result

Retrofitting compressed memory onto pretrained LLMs fails.

The Wrong-Latent Test asks: when a model appears to answer from compressed memory, is it actually reading the compressed latents, or answering from parametric knowledge? Two diagnostics — wrong-latent swap (replace the correct latent with an unrelated document's; performance should collapse if reading is real) and gate-trajectory analysis (the cross-attention gate should drift if the pathway is useful).

Across 46 experiments, covering 4 interface families instantiated in 18 configurations, 0.5B–72B reader scales, 3 encoder training regimes, and trainable-parameter counts spanning 17.7M–10.5B, every tested retrofit failed both diagnostics. The wrong-latent gap was statistically indistinguishable from zero (p > 0.6, n = 80); the gate parameter did not move.

Conclusion — the conclusion is not that compressed memory is impossible, but that pretrained LLM retrofits do not learn to read opaque latents under the tested regimes. A different architecture is needed: a processor native to the latent space, not a pretrained language model trying to treat latents as a second language. Also: 49 % of questions in a standard document-QA pool were answerable from the LLM's own weights, motivating a contamination filter.

Step 2 · Core mechanism

The manifold itself is the memory.

The Manifold Is the Memory tests whether content-addressable memory can be realized by geometric navigation inside a frozen reconstruction-autoencoder latent manifold, without contrastive IR supervision. Architecture: a frozen 92M-parameter text autoencoder + a 30M-parameter latent-native processor trained from scratch on 197,578 GitHub-corpus QA pairs. The processor is not a text-generating LLM — it ingests document and question latents and outputs answer latents.

On 1,247,342 real in-distribution distractors: P@1 = 20.5 %, MRR = 0.250, median rank 101, lift-over-random 255,705×. From 197K → 1.25M candidates, P@1 drops only 0.4 pp. A max-scale audit at 3,719,845 real distractors shows P@1 degrading only 2.1 pp (0.3105 → 0.2895) over a 19× pool growth.

Mechanism → scale link — retraining the autoencoder with a noise-perturbed reconstruction attractor objective sharpens the doc/question paired cosine-similarity gap from +0.095 → +0.164 (+73 %; Cohen's d = 0.80), and that mechanism improvement transfers to retrieval: at 197K, apples-to-apples P@1 rises 20.9 % → 28.8 % (+37.8 % relative).

Step 3 · The navigator

Engineering the processor at commercial scale.

Navigating the Manifold asks: given a frozen manifold that behaves like memory, what processor should navigate it? Three axes: processor architecture, training curriculum, autoencoder objective. A 30M Perceiver-IO processor strictly outperforms a parameter-matched dense transformer; scaling from 34M → 95M yields flat P@1 — the retrieval ceiling is set by the frozen manifold, not the processor.

The strongest training result is MixK curriculum (per-item K ∈ {1, 2, 4, 8}). It repairs a 26× long-context collapse: K=15 P@1 0.010 → 0.138 (14× lift), and improves single-document retrieval by +34.5 % at a 197K pool.

Critical negative result — good ideas do not compose. Hard-NCE alone improves K=1 retrieval by +11.9 %, but combined with MixK, full-pool P@1 collapses 0.351 → 0.118. CLOOB sharpens within-modality separation by 2.86× but collapses the cross-modal question–passage margin to +0.100, too low for 7.47M-scale distractor noise. Sharpness is not scalar.

§ 04 · The correction

The oracle correction and why it matters.

Trust signal: a methodology audit corrected a major published claim before any external party found it. The architecture moved with the correction.

Earlier versions reported “MixK over full pool” numbers suggesting the latent processor could serve as a first-stage retriever. A later methodology audit found the harness was accidentally conditioned on the gold document latent at retrieval time. After removing the oracle, non-oracle first-stage P@1 collapses to 0.0430 at 7.77M — on par with NDN mean-pooling. The paper explicitly revises the claim: those original numbers are an oracle bi-encoder upper bound, not a shippable first-stage retriever.

The correction changes the product architecture. MixK is not the first-stage retriever. MixK is a reranker / navigator over a candidate set. The production-realistic system becomes:

BGE-large ∪ QNDN v0 → MixK → evidence-bounded reader / refusal layer

Standard dense retrieval and a latent-native projection head supply candidates from independent geometries. A 30M Perceiver reranker compresses that candidate set into top-k evidence. A reader produces a cited answer or a refusal. No oracle. No conditioning leaks. Every stage independently auditable.

Receipts: PAPER3_FINAL.md §1 contribution 4 · recovered-journal Phase 80-B (oracle bi-encoder revision) · non-oracle P@1 = 0.0430 @ 7.77M (5-test matrix, Q=2000)

§ 05 · Warrant pipeline

Current production architecture.

Designed for private, regulated corpora that cannot be sent to external APIs. Each stage is an independent contract.

Stage 1a

BGE-large-en-v1.5

Standard dense retrieval. Strong at rank 1 (sharp-head); known weak coverage at deep K.

Stage 1b

QNDN v0 · 13.1M-param QPhead

Cross-attention head producing a single 512-d query vector over the attractor autoencoder. Weak at P@1; shallow-tail — stronger than BGE at P@1000 / P@10000.

⇓ min-rank union, dedup

Stage 2

MixK reranker

30M-parameter Perceiver-IO trained with mixed-K curriculum. Reranks the union into top-10 evidence chunks. Compresses deep candidate recall into shallow precision.

⇓

Stage 3

Evidence-bounded reader

Open reader (Gemma / Qwen / Llama family). Produces a cited answer or an explicit refusal. Refusal is a first-class product state, not a prompt trick.

Navigating the Manifold reports the shipping pipeline as BGE-large ∪ QNDN v0 → MixK, with stage-2 P@1 lifted from ~0.18 to ~0.34 and P@10 from ~0.20 to ~0.47 over the deployed BGE-small → MixK baseline (+83 % / +133 % relative). It states that no single retriever in the inventory matches the union.

§ 06 · The product breakthrough

QNDN is not a better BGE. It's an orthogonal BGE.

BGE-large dominates at shallow ranks. The retrieval flip happens at deep K — QNDN v0 holds gold passages where BGE drops them. The union is what makes the product.

Stage-1 native recall · 3.52 M corpus

Same 2000-query harness, same gold pool, same encoder versions. Numbers from the recovered Phase 86 union-recall index (`final_matrix_3m5_k10000.md`).

Metric @ 3.52 M	BGE-large	QNDN v0	Flip?
P@1	0.149	0.068	—
P@10	0.224	0.167	—
P@100	0.302	0.274	—
P@1000	0.395	0.439	QNDN +11 %
P@10000	0.510	0.645	QNDN +27 %

// QNDN v0 surfaces relevant evidence in a region of the manifold where BGE-large has already given up. The union therefore has higher candidate coverage than either retriever alone, before any reranking.

Survives at 7.52 M corpus

Same pattern under a 19× corpus growth (`final_matrix_7m52_k10000.md`).

Metric @ 7.52 M	BGE-large	QNDN v0	Flip?
P@1000	0.393	0.421	QNDN +7 %
P@10000	0.501	0.628	QNDN +25 %

// At 3.52 M, BGE alone misses 49 % of gold passages at K=10,000; BGE ∪ QNDN v0 misses 18 %. A 31 pp recall improvement before reranking — that's the room MixK then concentrates into top-10.

QNDN does not need to win at P@1. It needs to hold gold at K=1000+ so the union can hand it to MixK.

The product is not “a better embedding”. It is two retrievers with different geometries, fed into a reranker trained to concentrate deep-K coverage into shallow precision. Replacing either component breaks the contract.

§ 07 · Memory-agent evidence

LongMemEval-S: structured retrieval beats raw context.

500 questions, GPT-4o judge, K=5 multi-seed 3-of-5 majority vote. Reader held constant; only the retrieval / routing layer changes.

Retrieval contract

96.20%

R@5 — gold fragment in MixK top-5 on 481/500 questions. Retrieval is not the bottleneck.

End-to-end QA

70.0%

Gemma-4-26B-A4B hybrid (oracle qtype router → stack or 110K-naked). Canonical baseline.

Stack-only

63.8%

Same reader, retrieval-only path. +6.2 pp lift from routing SSA to a full-context reader.

Naked 110K

53.2%

Same reader, no retrieval, full haystack in context. Stack +10.6 pp over naked even with reader unchanged.

The honest reading is not “SOTA memory”. It is: structured context delivery substantially improves a modest open-weight reader under a private retrieval setup. Public phrasing: “Warrant improves self-hosted open-reader memory performance by routing evidence, not by relying on a frontier full-context reader.”

§ 08 · Calibrated refusal

Refusal as a first-class product state.

Three explicit answer states; the system prompt is contractual, not cosmetic.

State A

Answered with citations.

The reader produced a cited claim grounded in retrieved evidence. Every citation maps back to a source fragment in the corpus.

State B

Refused — evidence absent.

The candidate set does not contain a fragment that answers the question. The reader is contractually required to refuse rather than guess.

State C

Refused — confidence floor.

Evidence exists but does not clear the calibrated confidence floor. Refusal is the safe default in regulated verticals where confident-wrong is a liability event.

This part is commercially strong but scientifically less mature than the retrieval work. Current audits are promising but small. A continuation team should commission larger adversarial refusal benchmarks before lean-pushing this as a SOTA refusal claim.

§ 09 / 10 · Scope

What this is, and what this is not.

For continuity, scope discipline is critical. The strongest papers explicitly state these boundaries; the program survives because it does not overclaim.

What has been shown scientifically

Pretrained-LLM latent-memory retrofits fail under wrong-latent and gate-trajectory diagnostics (46 experiments).
A frozen reconstruction-autoencoder manifold supports million-scale doc-anchored QA retrieval with shallow degradation (3.72M real distractors, −2.1 pp from 197K).
Attractor training sharpens the manifold and improves retrieval — a mechanism → scale causal link (cos-gap +0.095 → +0.164; d = 0.80).
Processor size is not the main lever; curriculum and manifold geometry matter more (34M → 95M flat).
MixK curriculum improves both long-context and K=1 retrieval (14× OOD lift; +34.5 % in-distribution).
Some good objectives are antagonistic; within-modality sharpness can damage cross-modal alignment.
QNDN v0 provides complementary deep-K geometry relative to BGE-large.

What has not been claimed scope boundary

A new foundation model.
A replacement for all vector databases.
SOTA general retrieval.
BEIR zero-shot transfer (the second paper reports BEIR zero-shot failure as a pre-registered boundary).
That QNDN beats BGE at P@1 (it doesn't).
That compressed-latent retrofits onto pretrained LLMs work (the first paper says they don't).
That free-form generation from latents is solved.
That language-free reasoning has been fully demonstrated.

What has been shown commercially

A private-corpus retrieval appliance can be built around the stack.
The product does not require external API calls.
The system can plug into an existing dense retriever and an existing open reader.
The novel parts are the complementary latent retriever + MixK reranker + refusal contract.
Strongest value proposition: regulated corpora with no data egress.

Open product gaps in flight

Enterprise ACL-aware retrieval not yet fully specified.
Incremental indexing and deletion propagation need product engineering.
SOC2 / security packaging required for regulated buyers.
Cost claims need normalization against modern vendor pricing.
Refusal claims need larger adversarial audits.
Buyer education is nontrivial: not “another vector DB,” not a full LLM platform.

§ 11 · Maturity

Current maturity, by axis.

Honest internal assessment. Overall: serious / publishable, with one major open replication question and a partial enterprise-readiness gap.

Area	Maturity	Score
Wrong-latent diagnostic	High	7.5 / 10
Manifold-as-memory mechanism	Strong internal evidence	8.0 / 10
Hopfield / MCHN empirical bridge	Promising, needs replication	7.8 / 10
MixK curriculum / reranker	Strong applied result	7.8 / 10
QNDN standalone retriever	Weak at shallow K	5.8 / 10
QNDN as BGE complement	Strong	8.5 / 10
Warrant product wedge	Very strong	9.0 / 10
General retrieval SOTA proof	Incomplete	6.0 / 10
Enterprise readiness	Partial	6.5 / 10
Overall research program	Serious / publishable	8.0 / 10

§ 12 · Why DeepMind should care

Five intersection points.

The research is not “bigger LLM beats benchmark.” It addresses memory substrate, neural compression, latent computation, Hopfield-on-real-prose, and private deployment.

/01

Efficient memory for long-horizon agents.

Instead of dumping giant context windows into a reader, retrieve compact evidence from a learned memory geometry.

/02

Neural compressed representations.

The work directly tests whether compressed latents can be read, and where retrofit approaches fail.

/03

Continuous reasoning / latent computation.

Failed retrofit + successful latent-native processor jointly suggest models may need to be trained inside the representation space, not adapted to it later.

/04

Hopfield-style memory in real language manifolds.

Bridges theoretical Modern Continuous Hopfield predictions with measured behavior on million-scale natural-language corpora.

/05

Private, auditable AI systems.

Warrant is not just a benchmark artifact. It is a deployable retrieval primitive for settings where data cannot leave the customer boundary.

/06

The framing question.

What is the right memory substrate for long-horizon agents when raw context is too expensive, retrieval embeddings are incomplete, and private corpora cannot be externalized?

§ 13 · Continuation roadmap

What a continuation team should do next.

Seven workstreams. Reproduction first; then modern bakeoff; then domain transfer; then leakage audit; then QNDN extensions; then manifold scaling; then the high-risk co-evolved latent-native model.

A · reproduction

Reproduce the three core papers.

Independent reruns of Wrong-Latent Test, Manifold Is the Memory, and Navigating the Manifold on new datasets / readers / pools. Confirm degradation curves, attractor cos-gap improvement, Perceiver>dense at matched params, MixK curriculum lift, Hard-NCE/MixK antagonism, QNDN/BGE deep-K complementarity.

B · bakeoff

Modern baseline bakeoff.

BGE / E5 / Qwen embeddings & rerankers / ColBERTv2 / SPLADE / BM25+dense+RRF / BGE-reranker-large / Cohere & Voyage rerankers (where deployable). Fair comparison: best modern hybrid stack vs BGE ∪ QNDN → MixK → same reader/refusal.

C · transfer

Domain transfer.

Move beyond GitHub-MD: legal contracts, medical guidelines, scientific papers, security reports, enterprise KBs, multilingual corpora, code repos, long-running chat / agent memory. Question: does the manifold-memory behavior survive outside GitHub-MD?

D · leakage audit

Audit leakage and duplicates.

Exact text overlap, near-duplicates, generated-QA leakage, same-document chunk collisions, train/test document family overlap, memorized templatic answers, parametric contamination in reader-based eval. The credibility rests on mechanism, not benchmark gaming.

E · extend QNDN

Extend the complementary retriever.

Improve P@1000 / P@10000 without breaking complementarity. Train QNDN on non-GitHub corpora. Multi-vector QNDN (not just single 512-d). Test complementarity vs ColBERT-style dense. Whether QNDN adds recall beyond BM25+sparse+dense hybrids.

F · scale the manifold

Sweep autoencoder size / objective / corpus.

92M → 225M → 500M+ autoencoders. 32×512 vs larger latent slots. Reconstruction-only vs attractor vs contrastive vs hybrid. Domain-specific vs mixed-domain. Latent dim and slot count. Q/D asymmetry. Warning: scaling one component blindly does not reliably help.

G · nuclear option

Co-evolved latent-native models.

Train a model from initialization that only ever consumes compressed latent representations. Expensive but the most philosophically important continuation if the goal is latent-native reasoning rather than retrieval. The Wrong-Latent Test rules out the cheap path; this is the remaining high-risk / high-reward path.

§ 14 · Risks & weaknesses

What could break this story.

Recorded explicitly so a continuation team starts from the failure modes the principal already sees.

Scientific risks internal

Strongest results are internal; independent reproduction is the load-bearing next step.
Main retrieval evidence is within-distribution (GitHub-MD-derived QA).
Hopfield interpretation may be overfit to observed behavior; real test is whether the seven pillars survive replication.
Modern retrieval baselines are incomplete (no head-to-head against late-2025 retriever stacks).
The system is not BEIR-general; the second paper says so.
Latent generation is not solved.
Some metrics depend on generated QA pairs (LLM-synthesized at scale).

Product risks enterprise

Enterprise ACL-aware retrieval not fully specified.
Incremental indexing and deletion propagation need product engineering.
SOC2 / security packaging required for regulated buyers.
Cost claims must be normalized against modern vendor pricing.
Refusal claims need larger adversarial audits.
Buyer education is nontrivial: not another vector DB, but also not a full LLM platform.

§ 15 · Best positioning

What to say, and what not to say.

Calibrated language. Two true sentences instead of one false one.

scientific positioning

ManifoldMemory is a narrow research program showing that frozen reconstruction-autoencoder latent spaces can support content-addressable doc-anchored QA retrieval at million-candidate scale, with Hopfield-like geometric signatures and a measurable mechanism-to-scale causal link.

product positioning

Warrant is a private, no-egress retrieval appliance for regulated corpora. It combines a standard dense retriever with a latent-native complementary retriever, unions their candidates, reranks them through a small MixK navigator, and feeds cited evidence to an open reader with explicit refusal states.

Say accurate & impressive

“We found a complementary retrieval geometry that standard dense retrieval misses, and we built a practical union / rerank / refusal system around it.”
“Million-scale doc-anchored QA retrieval with shallow degradation, with seven Hopfield-consistent geometric pillars.”
“Warrant improves self-hosted open-reader memory performance by routing evidence, not by relying on a frontier full-context reader.”

Do not say unsupported

“We beat all retrieval systems.”
“QNDN beats BGE at rank 1.” (It doesn't, by design.)
“Compressed-latent retrofits onto pretrained LLMs work.” (The first paper falsifies this.)
“Latent-native reasoning is solved.”

§ 16 · Final summary

Continuity summary.

The compressed version a continuation team can read in 90 seconds and act on.

ManifoldMemory began as an attempt to compress language into dense latent memory and ask whether a model could reason over those latents instead of raw tokens. The first research phase falsified the obvious retrofit path: pretrained LLMs with cross-attention or prefix adapters did not read compressed latents, as shown by a paired wrong-latent diagnostic and gate-trajectory analysis across 46 experiments. The second phase discovered a more viable substrate: a frozen reconstruction-autoencoder latent manifold, navigated by a small latent-native processor, exhibits measurable content-addressable memory behavior for doc-anchored QA at million-candidate scale. The third phase turned that mechanism into a deployable retrieval architecture: a 30M Perceiver/MixK navigator, trained with a mixed-K curriculum, reranks candidate unions from standard dense retrieval and a latent-native QNDN retriever.

The central empirical discovery is that QNDN is not a better BGE-style retriever at rank 1. It is a different retrieval geometry. BGE-large is sharp-headed (strong at P@1 and shallow ranks); QNDN v0 is shallow-tailed (weak at P@1, but better at deep-K recall). Their union creates substantially higher candidate coverage than either alone; MixK then concentrates that coverage into top-k evidence for a reader. This makes the product commercially meaningful: Warrant can improve private-corpus evidence retrieval without asking customers to send data to external APIs.

The current project should be treated as a serious narrow-breakthrough candidate, not as a solved general retrieval system. Its strongest scientific claim is “retrieval as manifold geometry”; its strongest product claim is “private, auditable, evidence-first retrieval for regulated corpora.” The next step is independent reproduction, modern baseline bakeoff, cross-domain validation, and — if pursuing the deepest version — co-evolved latent-native models trained from scratch to operate inside the manifold rather than retrofitted onto language-token priors.

// Receipts & sources

[wrong-latent] paper_wlt_v6_draft.md — 46 experiments, 4 interface families × 18 configs, 0.5B–72B, 17.7M–10.5B trainable params, 49 % LLM-answerable, gate-drift < 0.03, p > 0.6 / n = 80.
[manifold-is-memory] PAPER2_FINAL.md + recovered journal Phase 53–57 — 197,578 QA pairs, 92M autoencoder, 30M processor, 1,247,342 distractors @ P@1=0.205 / MRR=0.250 / lift 255,705×, 3,719,845 distractors @ P@1=0.2895 (−2.1 pp from 197K), attractor cos-gap +0.095 → +0.164 (Cohen's d=0.80).
[navigating] PAPER3_FINAL.md + journal Phase 56-A / 58-alt / 80-B / 86 — Perc-30M>dense, 34M → 95M flat, MixK K∈{1,2,4,8} K=15 0.010 → 0.138 (14×), MixK +34.5 % @ 197K, Hard-NCE +11.9 % solo / 0.351 → 0.118 combined collapse, CLOOB 2.86× sharpening + cross-modal antagonism, non-oracle P@1 = 0.0430 @ 7.77M, QPhead 13.1M params, BGE-large ∪ QNDN v0 → MixK stage-2 P@1 0.34 / P@10 0.47.
[deep-K] _remote_pulls/phase86_k10000/final_matrix_3m5_k10000.md + final_matrix_7m52_k10000.md — full P@1 / P@10 / P@100 / P@1000 / P@10000 matrix; BGE@10000 = 0.5095, BGE ∪ v0@10000 = 0.822, recall lift +31.25 pp.
[lme-s] Recovered journal Phase 91 / 95-G–K — R@5 = 96.20 % (481/500); Gemma-4-26B-A4B Hybrid 70.0 % / Stack-only 63.8 % / Naked-110K 53.2 % under K=5, 3-of-5 GPT-4o judge MV. Hybrid uses oracle qtype routing for the published number; production-equivalent learned classifier pending (Δ expected 1–3 pp).
[live] manifoldmemory_site/index.html — live product page mirrors every number above (deep-K block, R@5, 70.0 % Hybrid, oracle-qtype disclosure).