Databricks published a framework on April 10 arguing that the real bottleneck for enterprise AI agents is not model size or context window length. It is memory: the persistent store of past conversations, user feedback, interaction trajectories, and business context that an agent can retrieve at inference time.
The company calls this “memory scaling,” a property where agent performance improves as external memory grows. Their experimental results on Databricks Genie Spaces back the claim with concrete numbers.
Three Scaling Axes, One Clear Winner
Databricks distinguishes memory scaling from two established approaches, according to its blog post:
Parametric scaling (bigger models) delivers diminishing returns at increasing cost. Context-window scaling (longer prompts) increases latency, raises compute costs, and degrades reasoning as irrelevant tokens compete for attention. Memory scaling relies on selective retrieval of high-signal information from persistent stores, keeping latency low and cost down while surfacing only what the agent needs.
The critical insight: more memory does not automatically help. Low-quality traces can teach wrong lessons, and retrieval gets harder as stores grow. The question is whether agents can use larger memories productively.
The Experiments
Databricks tested its MemAlign framework on Genie Spaces, a natural-language interface where business users ask data questions in English and get SQL answers.
Labeled data scaling: Starting from near zero, accuracy climbed steadily to 70% as annotated examples were added to memory, surpassing the expert-curated baseline (manually written schemas, domain rules, few-shot examples) by roughly 5 percentage points. Average reasoning steps dropped from 20 to 5. Agents stopped exploring databases from scratch once they could retrieve relevant context directly, per Databricks.
Unlabeled user logs: Feeding agents raw conversation logs (no gold answers, filtered only by automated LLM judge) produced accuracy above 50%, surpassing the expert baseline of 33% after just 62 log records. Reasoning steps fell from 19 to roughly 4.3. The takeaway: uncurated user interactions can substitute for hand-engineered domain instructions when filtered appropriately.
Organizational knowledge stores: Pre-computing organizational assets (table schemas, dashboard queries, glossaries) into structured memory delivered a roughly 10% accuracy improvement on data research benchmarks. Gains concentrated on questions requiring vocabulary bridging, table joins, and column-level knowledge.
Infrastructure Requirements
Databricks outlines what production memory scaling demands, going beyond simple vector databases to unified storage supporting structured queries, full-text search, and vector similarity in a single engine (citing PostgreSQL-based systems like Lakebase on Neon’s serverless infrastructure).
Three operational layers are required: bootstrapping (initial knowledge loading), distillation (converting raw episodic memories into generalized semantic patterns), and consolidation (deduplication, pruning, conflict resolution). On top of that, governance through identity-aware access controls, row-level security, column masking, data lineage, and audit trails.
The Convergence Thesis
The strategic argument is straightforward: as foundation models converge in capability, the differentiator for enterprise agents shifts to what memory they have accumulated rather than which model they call. A smaller model with a rich memory store could outperform a larger model with less memory, making memory infrastructure investment potentially more valuable than model parameter scaling, according to Databricks.
This positions memory as a new infrastructure market. The companies building memory management tooling, retrieval pipelines, and governance systems for persistent agent stores may capture more enterprise value than those competing on model benchmarks alone.