Retrieval, measured.
ManifoldMemory is an independent research lab working on the mathematics and engineering of retrieval at scale: how to find the right fragment in a million-document private archive, how to make a language model refuse when the evidence isn't there, and how to deploy both inside infrastructure that cannot make external API calls. Our research ships as Warrant, an evidence-first retrieval primitive for regulated enterprises.
A small research group. One hard problem.
Retrieval quality degrades as archives grow. Dense retrievers that work at 100 K documents lose double-digit accuracy at 10 M. The lab exists to understand why, fix it at the mathematical level, and ship the fix as a deployable product. We publish our measurements; we don't publish hype.
Every number on this site has a receipt.
If a metric doesn't have a pipeline, a pool definition, and a reproducible evaluation script behind it, it doesn't appear. We maintain a ~25,000-line experiment journal where every claim on every page is cross-referenced to raw data.
The things that didn't work are part of the product.
Four large programs of work were walked away from after honest measurement: a 7× bigger encoder, a 6.7× bigger reranker, an "obviously better" training objective that composed antagonistically, and a generation-from-latents track that mode-collapsed under every loss we tried. Each closure is documented.
Parameter count is not the moat.
Our production reranker is 30 M parameters. We tested 200 M and 500 M variants; they lost. The scaling advantage is in the training objective and manifold geometry, not in how many floats the model has.
Private-corpus is the design constraint, not an afterthought.
Every component runs on-premise on a single commodity GPU. No external API, no shared index, no data egress. The threat model is a regulated customer who literally cannot send data to a managed service, and the architecture reflects that.
Four open questions. Measured answers.
The lab operates on four active research threads. Each one has a measurable question, a reproducible experimental protocol, and a set of findings that feed directly into the product.
Hopfield retrieval on real natural language
Modern Hopfield networks (Ramsauer et al.) predict an exponential-in-Δ² storage capacity and a β-regime phase transition in retrieval dynamics. Both results had been demonstrated mostly on synthetic data. We measured them on a learned 7.47 M-row natural-language manifold.
When two good ideas fight each other
The naive assumption in retrieval research is that orthogonal interventions compose additively. We repeatedly find they don't. A sharper within-modality manifold, combined with a mixed-K curriculum, collapses P@1 by 60%. A CLOOB-sharpened substrate, combined with a cross-modal projection head, retrieves 1.9× worse than the un-sharpened substrate does.
The geometry of retriever failure modes
At 1 M+ documents, different retrievers fail in different shapes. A sharp-head retriever lands the gold answer at rank 1 or not at all. A shallow-tail retriever reliably lands it near rank 50 but rarely at rank 1. Reporting a single P@K number hides the shape. We characterise retrievers across the full depth curve.
The LLM contract for evidence-bounded corpora
A language model grounded on retrieved evidence should answer when the evidence is sufficient, cite when it answers, and refuse otherwise. Most deployed RAG systems fail the third clause. We treat refusal as a first-class primitive with its own audit suite, including a cross-encoder-style similarity floor and a distinct UX for three outcome states.
Field-level contributions, honestly scoped.
Quantities we have measured that, to our knowledge, had not previously been measured on a learned natural-language manifold at this scale. Each has a file path and a reproducible evaluation behind it. None has been peer-reviewed yet; that's the gate between our journal and a submittable paper.
Minimum pairwise cosine separation Δ on a 7.47 M-row NL manifold, before and after CLOOB continued training. 0.051 → 0.146. First direct measurement of the CLOOB → Hopfield capacity bridge at M+ scale on real prose.
Continuation-retrieval P@1 traces a clean metastable-regime optimum at β ≈ 10 on the sharpened manifold. The transition is absent from the un-sharpened manifold under identical evaluation. Present-with / absent-without causal cut.
One step of iterative Hopfield refinement: ΔP@1 = −0.021 on the soft manifold, +0.076 on the sharp one. Same operator, opposite effect — the in-regime / out-of-regime cut the theory predicts.
QPhead trained on the CLOOB-sharpened substrate retrieves 1.9× worse at native P@1 than the same head on the un-sharpened substrate. Within-modality sharpening is measurably antagonistic to cross-modal alignment.
197 K → 3.72 M real distractors. Off-the-shelf dense P@1 drops 21.1%; our stack drops 6.4%. Median rank of the correct answer: 14 vs 16,197. 13.4× shallower rank-degradation.
Two complementary first-stage retrievers (sharp-head + shallow-tail) reranked over their union lift P@1 by +83% over the best single-retriever baseline at 3.52 M. Top-1000 union coverage: 65.1% vs 39.5 / 43.9 % alone.
Journal-first, measured always, small where possible.
The lab's operating mode isn't a publication pipeline; it's a measurement pipeline. Experiments run nightly, results go into a single versioned journal, and the product inherits only what survives ablation.
What the lab does active
- Builds and measures retrieval primitives at up to 7.5 M real documents on one-to-two GPU stations.
- Maintains a strict apples-to-apples benchmarking hygiene: matched pools, matched queries, matched evaluators across every head-to-head.
- Ships production-grade pipelines (Warrant) on the same hardware researchers prototype on — no ops gap between research and deploy.
- Publishes negative results with the same rigour as positive ones; the pinned "do-not-chase" list is public in our internal journal and summarised on this site.
What the lab explicitly does not do out of scope
- Train foundation models. We build retrieval primitives that sit in front of any reader. The reader is the customer's choice.
- Chase benchmark leaderboards for their own sake. LongMemEval is a public receipt, not a goal; MTEB / BEIR replication is future work, not a current optimisation target.
- Serve web-scale open-domain search. That is a $100 B-R&D-spend moat we cannot touch on its axes, and we don't try.
- Ship a managed cloud API. The product is deployable on commodity hardware inside the customer's infrastructure. That is the whole design constraint.
The lab is open to technical collaboration.
Researchers interested in Hopfield-on-natural-language, retrieval scaling, or evidence-first LLM contracts can write to the lab directly. We share reproducible pipelines for any published number on this site, typically under an NDA or academic-collaboration agreement. Commercial evaluation requests should use the Warrant page below.
Response time: typically 2–5 business days.