Hugging the Face shows how test-time scaling can help small language models punch above their weight.


Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more


In a new case study, Hugging Face researchers show how small language models (SLMs) can be configured to outperform larger models. Their findings show that a Llama 3 model with 3B parameters can outperform the 70B version of the model in complex mathematical problems.

Hugs the Face fully documented the entire process and provides a roadmap for businesses that want to create their own customized reasoning models.

Image source: Hugging the Face

Scaling the computation time test

The work is inspired by OpenAI o1that uses more “thinking” to solve complex problems in math, coding and reasoning.

The key idea behind models like o1 is the scaling of “test-time compute,” which effectively means using multiple compute cycles during inference to test and verify. the different answers and paths of reasoning before making the final answer. Scaling test-time compute is especially useful when there is not enough memory to run a large model.

Since o1 is a private model and OpenAI has kept quiet about its internal workings, researchers have speculated about how it works and tried to reverse the process. That’s a lot open alternatives to o1.

Face hugging work is based on a The DeepMind study released in Augustwhich investigates the tradeoffs between inference-time and pre-training compute. The study provides comprehensive guidelines on how to balance training and inference computation to get the best results for a given budget.

In addition to using more inference-time computation, the success of the technique depends on two key components: a reward model that evaluates the responses of the SLM, and a search algorithm that optimizes the path that needed to refine its answers.

Image source: Hugging the Face

Different reasoning algorithms

The simplest way to use test-time scaling is “majority voting,” where the same prompt is sent to the model multiple times and the highest vote is chosen. In simple problems, majority voting can prove useful, but its advantages quickly increase in complex reasoning problems or tasks where errors are frequent across generations.

A more advanced method of reasoning is “Best-of-N.” In this technique, SLM generates multiple responses, but instead of majority voting, a reward model is used to evaluate the responses and select the best. “Weighted Best-of-N,” a more nuanced version of this method, factors consistency in choosing answers that are both confident and occur more frequently than others.

The researchers used a “process reward model” (PRM) that scored the SLM’s response not only for the final response but for the many stages it goes through to reach it. Their experiments show that Weighted Best-of-N and PRMs lead to Daga-3.2 1B close to the level of the Llama-3.2 8B on the difficult MATH-500 benchmark.

Image source: Hugging the Face

To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the response in one pass, they use “beam searching,” an algorithm that guides the model’s response process step by step.

At each step, SLM generates many partial answers. The search algorithm uses a reward model to evaluate the responses and select a subset worth further exploration. The process is repeated until the model runs out of its inference budget or reaches the correct answer. In this way, the inference budget can be narrowed to focus on the best answers.

The researchers found that while ray tracing improves model performance on complex problems, it tends to underperform other techniques on simple problems. To meet this challenge, they added two more elements to their inference strategy.

First is the Diverse Verifier Tree Search (DVTS), a variant of the beam search that ensures that the SLM does not get stuck in wrong reasoning paths and diversifies its answer branches. Second, they developed a “compute-optimal scaling strategy,” as the DeepMind paper suggests, that dynamically selects the best test-time scaling strategy based on the difficulty of the input problem.

The combination of these techniques enables the Llama-3.2 1B to punch above its weight and outperform the 8B model by a significant margin. They also found that the strategy was scalable, and when applied to the Llama-3.2 3B, they were able to outperform the larger 70B model.

Not a perfect solution yet

The scaling test-time compute changes the dynamics of model costs. Enterprises today have the ability to choose where to allocate their computing resources. For example, if you are short on memory or can tolerate slower response times, you can use a smaller model and spend more cycles of inference time to generate more accurate responses.

However, scaling test time also has limitations. For example, in the Hugging Face experiments, the researchers used a specially trained Llama-3.1-8B model as the PRM, which required running both models in parallel (even though it was more more resource efficient than the 70B model). The researchers recognize that the holy grail of scaling test time is to have “self-verification,” where the original model verifies its own answer as opposed to relying on an external verifier. This is an open research area.

The test time scaling method presented in this study is also limited to problems where the answer can be clearly assessed, such as coding and mathematics. Developing reward models and verifiers for subjective tasks such as creative writing and product design requires further research.

But what is clear is that the time trial of scaling has created many interests and activities and we can expect more tools and techniques to emerge in the coming months. Businesses would be wise to keep an eye on how the landscape is evolving.



Source link
  • Related Posts

    Spring Spring Your Tech: Where to recycle your computers and printers for free

    I am willing to shrink that you have at least one old laptop,, desktop or printer To your house taking space in an office or a closet. Even if it…

    StateWing State and Local Govs seek to follow Elon’s footprints and do themselves

    Elon Musk tried to disturb the federal government by breaking up agencies that providing public service, keep Americans WIND and Water cleanTell us what to look like in the future,…

    Leave a Reply

    Your email address will not be published. Required fields are marked *