MemRL outperforms RAG in complex agent benchmarks without fine tuning



A new technique developed by researchers at Shanghai Jiao Tong University and other institutions enables large language modeling agents to learn new skills without requiring expensive refinement.

The researchers suggest MemRLa framework that gives agents the ability to develop episodic memory, the capacity to retrieve past experiences to create solutions for unseen tasks. MemRL allows agents to use environmental feedback to refine their problem-solving strategies continuously.

MemRL is part of a broader push by the research community to develop continuous learning capabilities for AI applications. In experiments on key industry benchmarks, the framework outperforms other baselines such as RAG and other memory organization methods, especially in complex environments that require exploration and experimentation. This suggests that MemRL can be a critical component for building AI applications that must operate in dynamic real-world settings where requirements and tasks are constantly shifting.

The strength-plasticity dilemma

One of the central challenges in deploying agent applications is adapting the underlying model to new knowledge and tasks after the initial training phase. Current methods generally fall into two categories: parametric methods, such as good repairand non-parametric approaches, such as RAG. But both come with significant trade-offs.

Fine tuning, while effective at cooking up new information, is expensive and computationally slow. More critically, it often leads to catastrophic obliviona phenomenon where newly acquired knowledge overwrites previously learned data, degrading the overall performance of the model.

In contrast, non-parametric methods such as RAG are passive; They extract information based only on semantic similarity, such as vector embeddings, without evaluating the actual utility of the information in the input query. This method assumes that "same means useful," which is often defective in complex reasoning tasks.

The researchers argue that human intelligence solves this problem by maintaining a “delicate balance between the robustness of cognitive reasoning and the plasticity of episodic memory.” In the human brain, stable reasoning (associated with the cortex) is separated from dynamic episodic memory. This allows people to adapt to new tasks without "rewiring neural circuitry" (the rough equivalent of the fine-tuning model).

Within the MemRL framework

Inspired by people’s use of episodic memory and cognitive reasoning, MemRL is designed so that an agent can continuously improve its performance after deployment without compromising the stability of its backbone LLM. Instead of changing the model parameters, the framework transfers the adaptation mechanism to an external, self-changing memory structure.

In this architecture, LLM parameters remain completely frozen. The model works effectively as "cortex," is responsible for the overall reasoning, logic, and code generation, but it is not responsible for storing specific successes or failures encountered after deployment. This structure ensures strong cognitive reasoning and prevents catastrophic forgetting.

To handle adaptation, MemRL maintains a dynamic episodic memory component. Instead of storing plain text documents and static embedding values, as is common with RAG, MemRL organizes memory into "purpose-experience-utility" triplets. It consists of the user’s question (the goal), the specific solution path or action taken (the experience), and a score, known as the Q-value, that represents how successful this specific experience was in the past (the utility).

Importantly for enterprise architects, this new data structure does not require the destruction of existing infrastructure. "MemRL is designed to be a ‘drop-in’ replacement for the retrieval layer of existing technology stacks and is compatible with various vector databases," Muning Wen, a co-author of the paper and PhD candidate at Shanghai Jiao Tong University, told VentureBeat. "The existence and updating of ‘Q-Value’ is only for better evaluation and management of dynamic data… and is independent of the storage format."

This utility score is the key difference from classic RAG systems. During inference, MemRL agents use a "two-phase recovery" mechanism. First, the system identifies memories that are semantically close to the query to ensure relevance. It also ranks candidates based on their Q-value, effectively prioritizing proven strategies.

The framework incorporates reinforcement learning directly into the memory retrieval process. When an agent tries a solution and receives environmental feedback (ie, success or failure) it updates the Q-value of the acquired memory. This creates a closed feedback loop: over time, the agent learns to ignore distractor memories and prioritize high-value strategies without having to retrain the underlying LLM.

While adding a reinforcement learning step may seem like it adds significant latency, Wen says the computational overhead is minimal. "Our Q-value calculations are performed entirely on the CPU," he said.

MemRL also has a runtime continuous learning capability. When the agent encounters a new scenario, the system uses the frozen LLM to summarize the new trajectory and adds it to the memory bank as a new triplet. This allows the agent to expand its knowledge base dynamically as it interacts with the world.

It should be noted that automating the value task has a risk: If the system mistakenly confirms a bad interaction, the agent will learn the wrong lesson. Wen recognized this "poison memory" risk but noted that unlike black-box neural networks, MemRL remains transparent and auditable. "If a negative interaction is misclassified as a positive example… it can spread more widely," Wen said. "However … we can easily fix this by removing the contaminated data from the memory bank or resetting their Q-values."

MemRL works

The researchers evaluated MemRL against multiple baselines in four different industry benchmarks: BigCodeBench (code generation), ALFWorld (embodied navigation), Lifelong Agent Bench (OS and database interaction), and Humanity’s Last Exam (complex multidisciplinary reasoning).

Results showed that MemRL consistently outperformed baselines in both runtime learning (improvement during the session) and transfer learning (generalization to unseen tasks).

The advantages of this value-aware retrieval mechanism are particularly pronounced in exploration-heavy environments such as ALFWorld. In this benchmark, which requires agents to navigate and interact in a simulated home environment, MemRL achieves a relative improvement of approximately 56% above. The MemPanother agentic memory framework. Researchers have found that the reinforcement learning component effectively motivates the agent to explore and discover solutions for complex tasks that similarity-based retrieval methods often fail to solve.

When the memory bank was frozen and tested on restricted sets to measure generalization, MemRL achieved the highest accuracy in the benchmarks. For example, in the Lifelong Agent Bench, it significantly improved the standard RAG baseline of OS tasks. This shows that the system not only memorizes training data but effectively filters out low-value memories to retain high-utility experiences that generalize to new situations.

The broader picture is for self-developing agents

MemRL fits within a growing body of research focused on Memory-Based Markov Decision Processes (M-MDP), a formulation that positions memory retrieval as an active decision-making step rather than a passive search function. By treating retrieval as an action that can be optimized through reinforcement learning, frameworks such as MemRL and similar methods such as Memento paving the way for more autonomous systems.

For enterprise AI, this shift is significant. It proposes a future where agents can be deployed using a general-purpose LLM and then rapidly adapt to company-specific workflows, proprietary databases, and unique problem sets through interaction alone. The key change we see is frameworks that treat applications as dynamic environments that they can learn from.

These evolving capabilities will allow organizations to maintain consistent, high-performance agents that evolve along with their business needs, solving the problem of stale models without incurring the prohibitive costs of constant retraining.

It marks a transition in how we value data. "In a future where static data is about to be exhausted, the interaction experience generated by each intelligent agent during its lifetime will be the new fuel," Wen said.



Source link

  • Related Posts

    Livestream FA Cup Soccer: Watch Liverpool vs Brighton From Anywhere

    When to watch Liverpool vs. Brighton Saturday, Feb. 14 at 3 pm ET (12 pm PT) Where to watch Liverpool vs. Brighton Liverpool vs. Brighton will be broadcast in the…

    Dems Want to Ban Pricing Surveillance at Big Grocery Stores

    Sen. Ben Ray Luján, a Democrat from New Mexico, and Senator Jeff Merkley, a Democrat from Oregon, introduced legislation on Thursday that would ban so-called surveillance and price gouging at…

    Leave a Reply

    Your email address will not be published. Required fields are marked *