Research continuity brief · audited 2026-04-25 against recovered journal · v1.0

ManifoldMemory / Warrant

A self-contained continuity brief: scope, scientific arc, production architecture, current maturity, and what a continuation team should do next.

ManifoldMemory is a research program investigating whether natural-language memory can be represented and retrieved as geometry inside frozen autoencoder latent manifolds, rather than as raw text context or contrastively trained embeddings. Warrant is the productized private-corpus retrieval system built from that research.

The core claim is not that this replaces LLMs or beats every retriever. It is narrower and stronger: a reconstruction-autoencoder latent manifold can behave like a content-addressable memory for doc-anchored QA at million-candidate scale, and a small latent-native processor/reranker can navigate that manifold in ways complementary to standard dense retrieval.

Read the arc Receipts & sources / 16 sections / every number traceable to journal or paper

Three scientific steps, one of them a falsification.

Each step is a paper. The first is a negative result that clears the path; the second discovers the substrate; the third engineers the navigator.
Step 1 · Negative result

Retrofitting compressed memory onto pretrained LLMs fails.

The Wrong-Latent Test asks: when a model appears to answer from compressed memory, is it actually reading the compressed latents, or answering from parametric knowledge? Two diagnostics — wrong-latent swap (replace the correct latent with an unrelated document's; performance should collapse if reading is real) and gate-trajectory analysis (the cross-attention gate should drift if the pathway is useful).

Across 46 experiments, covering 4 interface families instantiated in 18 configurations, 0.5B–72B reader scales, 3 encoder training regimes, and trainable-parameter counts spanning 17.7M–10.5B, every tested retrofit failed both diagnostics. The wrong-latent gap was statistically indistinguishable from zero (p > 0.6, n = 80); the gate parameter did not move.

Conclusion — the conclusion is not that compressed memory is impossible, but that pretrained LLM retrofits do not learn to read opaque latents under the tested regimes. A different architecture is needed: a processor native to the latent space, not a pretrained language model trying to treat latents as a second language. Also: 49 % of questions in a standard document-QA pool were answerable from the LLM's own weights, motivating a contamination filter.
Step 2 · Core mechanism

The manifold itself is the memory.

The Manifold Is the Memory tests whether content-addressable memory can be realized by geometric navigation inside a frozen reconstruction-autoencoder latent manifold, without contrastive IR supervision. Architecture: a frozen 92M-parameter text autoencoder + a 30M-parameter latent-native processor trained from scratch on 197,578 GitHub-corpus QA pairs. The processor is not a text-generating LLM — it ingests document and question latents and outputs answer latents.

On 1,247,342 real in-distribution distractors: P@1 = 20.5 %, MRR = 0.250, median rank 101, lift-over-random 255,705×. From 197K → 1.25M candidates, P@1 drops only 0.4 pp. A max-scale audit at 3,719,845 real distractors shows P@1 degrading only 2.1 pp (0.3105 → 0.2895) over a 19× pool growth.

Mechanism → scale link — retraining the autoencoder with a noise-perturbed reconstruction attractor objective sharpens the doc/question paired cosine-similarity gap from +0.095 → +0.164 (+73 %; Cohen's d = 0.80), and that mechanism improvement transfers to retrieval: at 197K, apples-to-apples P@1 rises 20.9 % → 28.8 % (+37.8 % relative).
Step 3 · The navigator

Engineering the processor at commercial scale.

Navigating the Manifold asks: given a frozen manifold that behaves like memory, what processor should navigate it? Three axes: processor architecture, training curriculum, autoencoder objective. A 30M Perceiver-IO processor strictly outperforms a parameter-matched dense transformer; scaling from 34M → 95M yields flat P@1 — the retrieval ceiling is set by the frozen manifold, not the processor.

The strongest training result is MixK curriculum (per-item K ∈ {1, 2, 4, 8}). It repairs a 26× long-context collapse: K=15 P@1 0.010 → 0.138 (14× lift), and improves single-document retrieval by +34.5 % at a 197K pool.

Critical negative result — good ideas do not compose. Hard-NCE alone improves K=1 retrieval by +11.9 %, but combined with MixK, full-pool P@1 collapses 0.351 → 0.118. CLOOB sharpens within-modality separation by 2.86× but collapses the cross-modal question–passage margin to +0.100, too low for 7.47M-scale distractor noise. Sharpness is not scalar.

The oracle correction and why it matters.

Trust signal: a methodology audit corrected a major published claim before any external party found it. The architecture moved with the correction.

Earlier versions reported “MixK over full pool” numbers suggesting the latent processor could serve as a first-stage retriever. A later methodology audit found the harness was accidentally conditioned on the gold document latent at retrieval time. After removing the oracle, non-oracle first-stage P@1 collapses to 0.0430 at 7.77M — on par with NDN mean-pooling. The paper explicitly revises the claim: those original numbers are an oracle bi-encoder upper bound, not a shippable first-stage retriever.

The correction changes the product architecture. MixK is not the first-stage retriever. MixK is a reranker / navigator over a candidate set. The production-realistic system becomes:

BGE-large ∪ QNDN v0  →  MixK  →  evidence-bounded reader / refusal layer
Standard dense retrieval and a latent-native projection head supply candidates from independent geometries. A 30M Perceiver reranker compresses that candidate set into top-k evidence. A reader produces a cited answer or a refusal. No oracle. No conditioning leaks. Every stage independently auditable.
Receipts: PAPER3_FINAL.md §1 contribution 4 · recovered-journal Phase 80-B (oracle bi-encoder revision) · non-oracle P@1 = 0.0430 @ 7.77M (5-test matrix, Q=2000)

Current production architecture.

Designed for private, regulated corpora that cannot be sent to external APIs. Each stage is an independent contract.
Stage 1a
BGE-large-en-v1.5
Standard dense retrieval. Strong at rank 1 (sharp-head); known weak coverage at deep K.
Stage 1b
QNDN v0 · 13.1M-param QPhead
Cross-attention head producing a single 512-d query vector over the attractor autoencoder. Weak at P@1; shallow-tail — stronger than BGE at P@1000 / P@10000.
⇓ min-rank union, dedup
Stage 2
MixK reranker
30M-parameter Perceiver-IO trained with mixed-K curriculum. Reranks the union into top-10 evidence chunks. Compresses deep candidate recall into shallow precision.
Stage 3
Evidence-bounded reader
Open reader (Gemma / Qwen / Llama family). Produces a cited answer or an explicit refusal. Refusal is a first-class product state, not a prompt trick.

Navigating the Manifold reports the shipping pipeline as BGE-large ∪ QNDN v0 → MixK, with stage-2 P@1 lifted from ~0.18 to ~0.34 and P@10 from ~0.20 to ~0.47 over the deployed BGE-small → MixK baseline (+83 % / +133 % relative). It states that no single retriever in the inventory matches the union.

QNDN is not a better BGE. It's an orthogonal BGE.

BGE-large dominates at shallow ranks. The retrieval flip happens at deep K — QNDN v0 holds gold passages where BGE drops them. The union is what makes the product.

Stage-1 native recall · 3.52 M corpus

Same 2000-query harness, same gold pool, same encoder versions. Numbers from the recovered Phase 86 union-recall index (`final_matrix_3m5_k10000.md`).

Metric @ 3.52 MBGE-largeQNDN v0Flip?
P@10.1490.068
P@100.2240.167
P@1000.3020.274
P@10000.3950.439QNDN +11 %
P@100000.5100.645QNDN +27 %
// QNDN v0 surfaces relevant evidence in a region of the manifold where BGE-large has already given up. The union therefore has higher candidate coverage than either retriever alone, before any reranking.

Survives at 7.52 M corpus

Same pattern under a 19× corpus growth (`final_matrix_7m52_k10000.md`).

Metric @ 7.52 MBGE-largeQNDN v0Flip?
P@10000.3930.421QNDN +7 %
P@100000.5010.628QNDN +25 %
// At 3.52 M, BGE alone misses 49 % of gold passages at K=10,000; BGE ∪ QNDN v0 misses 18 %. A 31 pp recall improvement before reranking — that's the room MixK then concentrates into top-10.
QNDN does not need to win at P@1. It needs to hold gold at K=1000+ so the union can hand it to MixK.
The product is not “a better embedding”. It is two retrievers with different geometries, fed into a reranker trained to concentrate deep-K coverage into shallow precision. Replacing either component breaks the contract.

LongMemEval-S: structured retrieval beats raw context.

500 questions, GPT-4o judge, K=5 multi-seed 3-of-5 majority vote. Reader held constant; only the retrieval / routing layer changes.
Retrieval contract
96.20%
R@5 — gold fragment in MixK top-5 on 481/500 questions. Retrieval is not the bottleneck.
End-to-end QA
70.0%
Gemma-4-26B-A4B hybrid (oracle qtype router → stack or 110K-naked). Canonical baseline.
Stack-only
63.8%
Same reader, retrieval-only path. +6.2 pp lift from routing SSA to a full-context reader.
Naked 110K
53.2%
Same reader, no retrieval, full haystack in context. Stack +10.6 pp over naked even with reader unchanged.

The honest reading is not “SOTA memory”. It is: structured context delivery substantially improves a modest open-weight reader under a private retrieval setup. Public phrasing: “Warrant improves self-hosted open-reader memory performance by routing evidence, not by relying on a frontier full-context reader.”

Refusal as a first-class product state.

Three explicit answer states; the system prompt is contractual, not cosmetic.
State A

Answered with citations.

The reader produced a cited claim grounded in retrieved evidence. Every citation maps back to a source fragment in the corpus.

State B

Refused — evidence absent.

The candidate set does not contain a fragment that answers the question. The reader is contractually required to refuse rather than guess.

State C

Refused — confidence floor.

Evidence exists but does not clear the calibrated confidence floor. Refusal is the safe default in regulated verticals where confident-wrong is a liability event.

This part is commercially strong but scientifically less mature than the retrieval work. Current audits are promising but small. A continuation team should commission larger adversarial refusal benchmarks before lean-pushing this as a SOTA refusal claim.

What this is, and what this is not.

For continuity, scope discipline is critical. The strongest papers explicitly state these boundaries; the program survives because it does not overclaim.

What has been shown scientifically

  • Pretrained-LLM latent-memory retrofits fail under wrong-latent and gate-trajectory diagnostics (46 experiments).
  • A frozen reconstruction-autoencoder manifold supports million-scale doc-anchored QA retrieval with shallow degradation (3.72M real distractors, −2.1 pp from 197K).
  • Attractor training sharpens the manifold and improves retrieval — a mechanism → scale causal link (cos-gap +0.095 → +0.164; d = 0.80).
  • Processor size is not the main lever; curriculum and manifold geometry matter more (34M → 95M flat).
  • MixK curriculum improves both long-context and K=1 retrieval (14× OOD lift; +34.5 % in-distribution).
  • Some good objectives are antagonistic; within-modality sharpness can damage cross-modal alignment.
  • QNDN v0 provides complementary deep-K geometry relative to BGE-large.

What has not been claimed scope boundary

  • A new foundation model.
  • A replacement for all vector databases.
  • SOTA general retrieval.
  • BEIR zero-shot transfer (the second paper reports BEIR zero-shot failure as a pre-registered boundary).
  • That QNDN beats BGE at P@1 (it doesn't).
  • That compressed-latent retrofits onto pretrained LLMs work (the first paper says they don't).
  • That free-form generation from latents is solved.
  • That language-free reasoning has been fully demonstrated.

What has been shown commercially

  • A private-corpus retrieval appliance can be built around the stack.
  • The product does not require external API calls.
  • The system can plug into an existing dense retriever and an existing open reader.
  • The novel parts are the complementary latent retriever + MixK reranker + refusal contract.
  • Strongest value proposition: regulated corpora with no data egress.

Open product gaps in flight

  • Enterprise ACL-aware retrieval not yet fully specified.
  • Incremental indexing and deletion propagation need product engineering.
  • SOC2 / security packaging required for regulated buyers.
  • Cost claims need normalization against modern vendor pricing.
  • Refusal claims need larger adversarial audits.
  • Buyer education is nontrivial: not “another vector DB,” not a full LLM platform.

Current maturity, by axis.

Honest internal assessment. Overall: serious / publishable, with one major open replication question and a partial enterprise-readiness gap.
AreaMaturityScore
Wrong-latent diagnosticHigh7.5 / 10
Manifold-as-memory mechanismStrong internal evidence8.0 / 10
Hopfield / MCHN empirical bridgePromising, needs replication7.8 / 10
MixK curriculum / rerankerStrong applied result7.8 / 10
QNDN standalone retrieverWeak at shallow K5.8 / 10
QNDN as BGE complementStrong8.5 / 10
Warrant product wedgeVery strong9.0 / 10
General retrieval SOTA proofIncomplete6.0 / 10
Enterprise readinessPartial6.5 / 10
Overall research programSerious / publishable8.0 / 10

Five intersection points.

The research is not “bigger LLM beats benchmark.” It addresses memory substrate, neural compression, latent computation, Hopfield-on-real-prose, and private deployment.
/01

Efficient memory for long-horizon agents.

Instead of dumping giant context windows into a reader, retrieve compact evidence from a learned memory geometry.

/02

Neural compressed representations.

The work directly tests whether compressed latents can be read, and where retrofit approaches fail.

/03

Continuous reasoning / latent computation.

Failed retrofit + successful latent-native processor jointly suggest models may need to be trained inside the representation space, not adapted to it later.

/04

Hopfield-style memory in real language manifolds.

Bridges theoretical Modern Continuous Hopfield predictions with measured behavior on million-scale natural-language corpora.

/05

Private, auditable AI systems.

Warrant is not just a benchmark artifact. It is a deployable retrieval primitive for settings where data cannot leave the customer boundary.

/06

The framing question.

What is the right memory substrate for long-horizon agents when raw context is too expensive, retrieval embeddings are incomplete, and private corpora cannot be externalized?

What a continuation team should do next.

Seven workstreams. Reproduction first; then modern bakeoff; then domain transfer; then leakage audit; then QNDN extensions; then manifold scaling; then the high-risk co-evolved latent-native model.
A · reproduction

Reproduce the three core papers.

Independent reruns of Wrong-Latent Test, Manifold Is the Memory, and Navigating the Manifold on new datasets / readers / pools. Confirm degradation curves, attractor cos-gap improvement, Perceiver>dense at matched params, MixK curriculum lift, Hard-NCE/MixK antagonism, QNDN/BGE deep-K complementarity.

B · bakeoff

Modern baseline bakeoff.

BGE / E5 / Qwen embeddings & rerankers / ColBERTv2 / SPLADE / BM25+dense+RRF / BGE-reranker-large / Cohere & Voyage rerankers (where deployable). Fair comparison: best modern hybrid stack vs BGE ∪ QNDN → MixK → same reader/refusal.

C · transfer

Domain transfer.

Move beyond GitHub-MD: legal contracts, medical guidelines, scientific papers, security reports, enterprise KBs, multilingual corpora, code repos, long-running chat / agent memory. Question: does the manifold-memory behavior survive outside GitHub-MD?

D · leakage audit

Audit leakage and duplicates.

Exact text overlap, near-duplicates, generated-QA leakage, same-document chunk collisions, train/test document family overlap, memorized templatic answers, parametric contamination in reader-based eval. The credibility rests on mechanism, not benchmark gaming.

E · extend QNDN

Extend the complementary retriever.

Improve P@1000 / P@10000 without breaking complementarity. Train QNDN on non-GitHub corpora. Multi-vector QNDN (not just single 512-d). Test complementarity vs ColBERT-style dense. Whether QNDN adds recall beyond BM25+sparse+dense hybrids.

F · scale the manifold

Sweep autoencoder size / objective / corpus.

92M → 225M → 500M+ autoencoders. 32×512 vs larger latent slots. Reconstruction-only vs attractor vs contrastive vs hybrid. Domain-specific vs mixed-domain. Latent dim and slot count. Q/D asymmetry. Warning: scaling one component blindly does not reliably help.

G · nuclear option

Co-evolved latent-native models.

Train a model from initialization that only ever consumes compressed latent representations. Expensive but the most philosophically important continuation if the goal is latent-native reasoning rather than retrieval. The Wrong-Latent Test rules out the cheap path; this is the remaining high-risk / high-reward path.

What could break this story.

Recorded explicitly so a continuation team starts from the failure modes the principal already sees.

Scientific risks internal

  • Strongest results are internal; independent reproduction is the load-bearing next step.
  • Main retrieval evidence is within-distribution (GitHub-MD-derived QA).
  • Hopfield interpretation may be overfit to observed behavior; real test is whether the seven pillars survive replication.
  • Modern retrieval baselines are incomplete (no head-to-head against late-2025 retriever stacks).
  • The system is not BEIR-general; the second paper says so.
  • Latent generation is not solved.
  • Some metrics depend on generated QA pairs (LLM-synthesized at scale).

Product risks enterprise

  • Enterprise ACL-aware retrieval not fully specified.
  • Incremental indexing and deletion propagation need product engineering.
  • SOC2 / security packaging required for regulated buyers.
  • Cost claims must be normalized against modern vendor pricing.
  • Refusal claims need larger adversarial audits.
  • Buyer education is nontrivial: not another vector DB, but also not a full LLM platform.

What to say, and what not to say.

Calibrated language. Two true sentences instead of one false one.
scientific positioning

ManifoldMemory is a narrow research program showing that frozen reconstruction-autoencoder latent spaces can support content-addressable doc-anchored QA retrieval at million-candidate scale, with Hopfield-like geometric signatures and a measurable mechanism-to-scale causal link.

product positioning

Warrant is a private, no-egress retrieval appliance for regulated corpora. It combines a standard dense retriever with a latent-native complementary retriever, unions their candidates, reranks them through a small MixK navigator, and feeds cited evidence to an open reader with explicit refusal states.

Say accurate & impressive

  • “We found a complementary retrieval geometry that standard dense retrieval misses, and we built a practical union / rerank / refusal system around it.”
  • “Million-scale doc-anchored QA retrieval with shallow degradation, with seven Hopfield-consistent geometric pillars.”
  • “Warrant improves self-hosted open-reader memory performance by routing evidence, not by relying on a frontier full-context reader.”

Do not say unsupported

  • “We beat all retrieval systems.”
  • “QNDN beats BGE at rank 1.” (It doesn't, by design.)
  • “Compressed-latent retrofits onto pretrained LLMs work.” (The first paper falsifies this.)
  • “Latent-native reasoning is solved.”

Continuity summary.

The compressed version a continuation team can read in 90 seconds and act on.

ManifoldMemory began as an attempt to compress language into dense latent memory and ask whether a model could reason over those latents instead of raw tokens. The first research phase falsified the obvious retrofit path: pretrained LLMs with cross-attention or prefix adapters did not read compressed latents, as shown by a paired wrong-latent diagnostic and gate-trajectory analysis across 46 experiments. The second phase discovered a more viable substrate: a frozen reconstruction-autoencoder latent manifold, navigated by a small latent-native processor, exhibits measurable content-addressable memory behavior for doc-anchored QA at million-candidate scale. The third phase turned that mechanism into a deployable retrieval architecture: a 30M Perceiver/MixK navigator, trained with a mixed-K curriculum, reranks candidate unions from standard dense retrieval and a latent-native QNDN retriever.

The central empirical discovery is that QNDN is not a better BGE-style retriever at rank 1. It is a different retrieval geometry. BGE-large is sharp-headed (strong at P@1 and shallow ranks); QNDN v0 is shallow-tailed (weak at P@1, but better at deep-K recall). Their union creates substantially higher candidate coverage than either alone; MixK then concentrates that coverage into top-k evidence for a reader. This makes the product commercially meaningful: Warrant can improve private-corpus evidence retrieval without asking customers to send data to external APIs.

The current project should be treated as a serious narrow-breakthrough candidate, not as a solved general retrieval system. Its strongest scientific claim is “retrieval as manifold geometry”; its strongest product claim is “private, auditable, evidence-first retrieval for regulated corpora.” The next step is independent reproduction, modern baseline bakeoff, cross-domain validation, and — if pursuing the deepest version — co-evolved latent-native models trained from scratch to operate inside the manifold rather than retrofitted onto language-token priors.

// Receipts & sources

  • [wrong-latent] paper_wlt_v6_draft.md — 46 experiments, 4 interface families × 18 configs, 0.5B–72B, 17.7M–10.5B trainable params, 49 % LLM-answerable, gate-drift < 0.03, p > 0.6 / n = 80.
  • [manifold-is-memory] PAPER2_FINAL.md + recovered journal Phase 53–57 — 197,578 QA pairs, 92M autoencoder, 30M processor, 1,247,342 distractors @ P@1=0.205 / MRR=0.250 / lift 255,705×, 3,719,845 distractors @ P@1=0.2895 (−2.1 pp from 197K), attractor cos-gap +0.095 → +0.164 (Cohen's d=0.80).
  • [navigating] PAPER3_FINAL.md + journal Phase 56-A / 58-alt / 80-B / 86 — Perc-30M>dense, 34M → 95M flat, MixK K∈{1,2,4,8} K=15 0.010 → 0.138 (14×), MixK +34.5 % @ 197K, Hard-NCE +11.9 % solo / 0.351 → 0.118 combined collapse, CLOOB 2.86× sharpening + cross-modal antagonism, non-oracle P@1 = 0.0430 @ 7.77M, QPhead 13.1M params, BGE-large ∪ QNDN v0 → MixK stage-2 P@1 0.34 / P@10 0.47.
  • [deep-K] _remote_pulls/phase86_k10000/final_matrix_3m5_k10000.md + final_matrix_7m52_k10000.md — full P@1 / P@10 / P@100 / P@1000 / P@10000 matrix; BGE@10000 = 0.5095, BGE ∪ v0@10000 = 0.822, recall lift +31.25 pp.
  • [lme-s] Recovered journal Phase 91 / 95-G–K — R@5 = 96.20 % (481/500); Gemma-4-26B-A4B Hybrid 70.0 % / Stack-only 63.8 % / Naked-110K 53.2 % under K=5, 3-of-5 GPT-4o judge MV. Hybrid uses oracle qtype routing for the published number; production-equivalent learned classifier pending (Δ expected 1–3 pp).
  • [live] manifoldmemory_site/index.html — live product page mirrors every number above (deep-K block, R@5, 70.0 % Hybrid, oracle-qtype disclosure).