MIT’s new refinement method allows LLMs to learn new skills without losing old ones

When businesses hire LLMs for new tasks, they run the risk of breaking all the already known models. This forces companies to maintain different models for each skill.

Researchers at MIT, the Improbable AI Lab and ETH Zurich have developed a new technique that enables large language models to learn new skills and knowledge without forgetting their previous capabilities.

Their technique, called refining self-distillation (SDFT), allows models to learn directly from demonstrations and their own experiments by using the inherent in-context learning abilities of modern LLMs. Experiments show that SDFT consistently outperforms traditional supervised fine-tuning (SFT) while addressing the limitations of reinforcement learning algorithms.

For business applications, the method enables a model to accumulate more skills over time without suffering from changing performance on previous tasks. This offers a potential path for building AI agents that can adapt to dynamic business environments, gathering new proprietary knowledge and skills when needed without requiring expensive retraining cycles or losing their general reasoning abilities.

The challenge of continuous learning

Once an LLM is trained and deployed, it remains static. It does not update its parameters to acquire new skills, internalize new knowledge, or improve from experience. To build truly adaptive AI, industry must address "continuous learning," allows systems to accumulate knowledge just as people do throughout their careers.

The most effective way for learning models is through "on-policy learning.” In this method, the model learns from the data it generates on its own which allows it to correct its own errors and reasoning processes. This is in contrast to learning by simply simulating static datasets. Without learning the policy, the models are easy to create "catastrophic oblivion," a phenomenon where learning a new task causes the model to lose its previous knowledge and ability to perform previous tasks.

However, policy learning is often required reinforcement learning (RL), which depends on a clear reward function to obtain model outputs. This is good for problems with clear outcomes, such as math and coding. But in many real-world business scenarios (for example, writing a legal brief or summarizing a meeting), it is difficult or impossible to define a mathematical reward function.

RL methods also often fail when trying to teach a model new information, such as a specific company protocol or a new product line. As Idan Shenfeld, a doctoral student at MIT and co-author of the paper, told VentureBeat, "No matter how many times the base model tries, it cannot generate correct answers for a subject without knowledge of it," meaning it can’t get a positive signal to detect.

The standard alternative is supervised fine-tuning (SFT), where the model is trained on a fixed dataset of expert demonstrations. While SFT provides a clear ground truth, it is inherently so "off-policy." Because the model only simulates the data instead of learning from its own tests, it often fails to generalize to non-distributed examples and suffers greatly from catastrophic forgetting.

SDFT seeks to bridge this gap: enabling the benefits of on-policy learning using only prerecorded demonstrations, without the need for a reward function.

How SDFT works

SDFT solves this problem by using "distillation," a process in which a student model learns to imitate a teacher. The researchers’ insight is to use their own model "in-context learning" (ICL) capabilities to create a feedback loop within a model.

Learning in context is the event where you give the LLM a difficult task and one or more demonstrations of how to solve similar problems. Most advanced LLMs are designed to solve new problems in ICL examples, without any parameter updating.

During the training cycle, SDFT uses the model in two roles.

The teacher: A frozen version of the model is fed into the question with expert demonstrations. Using ICL, the teacher captures the correct answer and the reasoning logic necessary to arrive at it.

The student: This version only shows the question, which simulates a real-world deployment scenario where no answer key is available.

When the student generates an answer, the teacher, who has access to expert demonstrations, provides feedback. The student then updates its parameters to be closer to the teacher’s distribution.

This process effectively creates an on-policy learning loop by combining elements of SFT and RL. Management does not come from a static dataset, but from the model’s own interaction and outputs. This allows the model to correct its own reasoning paths without requiring an external reward signal. This process works even for new knowledge that cannot be forgotten in RL.

SDFT in action

To validate the method, the researchers tested SDFT using open weights Qwen 2.5 model in three complex business-grade skills: scientific Q&A, software tool use, and medical reasoning.

The results show that SDFT learns new tasks more effectively than standard methods. In the Science Q&A benchmark, the SDFT model achieved 70.2% accuracy, compared to 66.2% for the standard SFT method.

More important for business adoption is the impact of disaster recovery. When the standard SFT model learns the scientific task, its ability to answer general questions (such as logic or the humanities) collapses. In contrast, the SDFT model improved the scientific work while holding it "Past Activities" score consistently at 64.5%. This stability suggests that companies can specialize models for specific departments (for example, HR or Legal) without compromising the basic common sense or reasoning ability of the model.

The team also simulated a knowledge injection scenario, creating a fictional dataset. "2025 Natural Disasters" to teach the model of new facts. They test the model with indirect reasoning questions, such as "Due to floods in 2025, which countries are most likely to need humanitarian assistance?"

Standard SFT results in a model that memorizes facts but struggles to apply them to reasoning scenarios. The SDFT model, which internalized the logic during training, got 98% of the same questions.

Finally, the researchers conducted a series of learning experiments, scientific model training, tool use, and medical tasks in sequence. While the performance of the standard model oscillated, losing previous skills to learn new ones, the SDFT model successfully accumulated all three skills without changing.

This capability addresses a key pain point for businesses that are currently managing "model zoo" of separate adapters for different tasks.

"We offer the ability to maintain only one model for all company needs," Shenfeld said. This consolidation "can lead to a large reduction in inference costs" because organizations do not need to host multiple models simultaneously.

SDFT limitations and availability

The code for SDFT is available on GitHub and is ready to be integrated into existing model training workflows.

"The SDFT pipeline is more similar to the RL pipeline as it requires online response generation during training," Shenfeld said. They are working with Hugging Face to integrate SDFT with the latter Learn Transformer Stabilization (TRL) library, he added, announcing that a pull request is now open for developers who want to test the integration.

For teams considering SDFT, the practical tradeoffs come down to model size and computation. The technique requires models that are powerful enough to learn the context to act as their own teachers – currently around 4 billion parameters with newer architectures like Qwen 3, although Shenfeld expects 1 billion parameter models to work soon. It requires approximately 2.5 times the calculation of standard fine-tuning, but is best suited for organizations that need a model to accumulate many skills over time, especially in domains where determining a reward function for reinforcement learning is difficult or impossible.

Although effective, the approach comes with calculating tradeoffs. SDFT is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine-tuning because the model must actively generate its own responses ("rollouts") during training to compare against the teacher. However, the researchers note that because the model retains better knowledge, organizations can avoid the expensive multi-stage retraining processes that are often required to repair models that suffer from catastrophic forgetting.

The technique also relies on the underlying model being large enough to benefit from contextual learning. The paper notes that small models (eg, 3 billion parameters) initially struggle because they lack "intelligence" to act as their own teachers.

However, Shenfeld says the rapid development of smaller models is changing this dynamic. "The Qwen 2.5 3B models are very weak, but in some experiments we have done today, we found that the Qwen 3 4B model is strong enough," he said. "I see a future where even 1B models have enough ICL capabilities to support SDFT."

Ultimately, the goal is to move beyond static snapshots to systems that evolve through use.

"Lifelong learning, with the ability to extract learning signals from unstructured user interaction… will lead to models that will continue and improve over time,” said Shenfeld.

“Consider the fact that most computation around the world goes to inference instead of training. We need to find ways to use this computation to improve our models."

Source link