Nvidia’s new technique cuts LLM logic costs by 8x without losing accuracy

Nvidia researchers have developed a technique that can reduce the cost of memory in large-scale logic model language up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key value (KV) cache, the temporary memory LLM creates and stores as they process prompts and reason through problems and documents.

While researchers have proposed various methods of compressing this cache in the past, most have struggled to do so without degrading the model’s intelligence. Nvidia’s approach manages to discard most of the cache while maintaining (and in some cases improving) the reasoning capabilities of the model.

Experiments have shown that DMS can produce LLMs "THINK" longer and explore more solutions without the usual speed penalty or memory cost.

The bottleneck of reasoning

LLMs improve their performance on complex tasks by creating "chain-of-mind" tokens, actually writing down their reasoning steps before arriving at the final answer. Inference-time scaling techniques exploit this by giving the model a larger budget to generate these cognitive tokens or to explore multiple potential inference paths simultaneously.

However, this improved reasoning comes with a significant computational cost. As the model generates more tokens, it builds a KV cache.

For real-world applications, the KV cache is a major bottleneck. As the logic chain grows, the cache grows linearly, consuming more memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows generation and increases latency. It also limits the number of users that a system can serve simultaneously, as low VRAM causes the system to crash or slow to a crawl.

Nvidia researchers frame this not only as a technical hurdle, but as a fundamental economy for business.

"The question is not just about the amount of hardware; it’s about whether your infrastructure processes 100 threads of logic or 800 threads at the same cost," Piotr Nawrot, Senior Deep Learning Engineer at Nvidia, told VentureBeat.

Previous attempts to solve this have focused on heuristics-based methods. These methods use strict rules, such as a "sliding window" which keeps only the most recent tokens and deletes the others. While this reduces memory usage, it often forces the model to discard critical information needed for solving the problem, which lowers the accuracy of the output.

"Standard eviction methods attempt to select old and unused tokens for eviction using heuristics," the researchers said. "They simplify the problem, hoping that if they approximate the internal mechanics of the model, the answer will remain correct."

Some solutions use paging to offload unused parts of the KV cache to make memory slower, but frequent data exchange introduces latency overhead that makes real-time applications slow.

Dynamic memory sparsification

DMS takes a different approach by "Renewal" existing LLMs to manage their own memory wisely. Instead of applying a specific rule for what to remove, DMS trains the model to determine which tokens are important for future reasoning and which are usable.

"It doesn’t just guess at the importance; it learns a policy that implicitly preserves the final distribution of the model’s output," Nawrot said.

The process transforms a standard, pre-trained LLM such as Llama 3 or Qwen 3 into a self-compressing model. In fact, it does not require training the model from scratch, which can be very expensive. Instead, DMS repurposes existing neurons within the model’s attention layers to output a "continue" or "drive" signal for each sign.

For teams concerned about the complexity of retrofitting, researchers note that the process is designed to be lightweight. "To improve the efficiency of this process, the model weights can be frozen, making the process similar to Low-Rank Adaptation (LoRA)," Nawrot said. This means a standard business model like Qwen3-8B "can be retrofitted by DMS within hours to a DGX H100."

One of the important features of DMS is the mechanism called "delayed eviction." In standard sparsification, if a token is considered unimportant, it is deleted immediately. This is dangerous because the model may need a fraction of a second to integrate the token context into its current state.

DMS mitigates this by flagging a marker for eviction but keeping it accessible for a short time (eg, a few hundred steps). This delay allows the model to "EXTRACTS" any remaining required information from the token and merge it into the current context before the token is deleted from the KV cache.

“The mechanism of ‘delayed eviction’ is important because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (delete immediately). Many fall in between – they carry some information, but not enough to justify occupying an entire memory slot,” said Nawrot. “This is where the redundancy lies. By storing these tokens in a local window for a short period of time before the eviction, we allow the model to take care of them and distribute their information to future tokens.”

The researchers found this retrofitting process to be very efficient. They can equip a pre-trained LLM with DMS in 1,000 training steps, a fraction of the computation required for the original training. The resulting models use standard kernels and can be dropped directly into existing high-performance inference stacks without custom hardware or complex software rewriting.

DMS in action

To validate the technique, the researchers applied DMS to several reasoning models, including the Qwen-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested it on difficult benchmarks such as AIME 24 (mathematics), GPQA Diamond (science), and LiveCodeBench (coding).

The results show that DMS effectively drives the Pareto frontier, the optimal trade-off between cost and performance. In the AIME 24 math benchmark, a Qwen-R1 32B model with DMS achieved a score 12.0 points higher than a standard model when constrained to the same memory bandwidth budget. By compressing the cache, the model can handle "THINK" deeper and wider than the standard model can be for the same memory and calculation budget.

Perhaps most surprisingly, DMS defies the conventional wisdom that compression damages long-term contextual understanding. on "needle-in-a-haystack" tests, which measure a model’s ability to find a specific piece of information buried in a large document, DMS variants actually outperform conventional models. By actively managing its memory instead of passively accumulating noise, the model maintains a cleaner, more useful context.

For enterprise infrastructure, efficiency gains translate directly to hardware throughput and storage. Because the memory cache is smaller, the GPU spends less time retrieving data, which reduces waiting time for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model while delivering up to 5x higher throughput. This means that a single server can handle five times the number of customer queries per second without losing quality.

The future of memory

Nvidia released DMS as part of this KVPress library. As for how businesses can get started with DMS, Nawrot emphasized that the barrier to entry is low. "The ‘minimum viable infrastructure’ is standard Hugging Face pipelines — no custom CUDA kernels are required," Nawrot said, noting that the code is fully compliant with the FlashAttention standard.

Looking ahead, the team sees DMS as part of a larger shift where memory management becomes a separate, intelligent layer in the AI stack. Nawrot also confirmed that the DMS "fully compatible" with newer architectures such as Many Heads Hidden Attention (MLA) used in DeepSeek models, suggesting that combining these methods may yield even greater efficiency gains.

As businesses move from simple chatbots to complex agent systems that require advanced reasoning, the cost of inference becomes a primary concern. Techniques such as DMS provide an avenue to measure these capabilities sustainably.

"We’ve barely scratched the surface of what’s possible," Nawrot said, "and we expect inference-time scaling to improve further."

Source link

itstargetnews.com

Or check our Popular Categories...

itstargetnews.com

Or check our Popular Categories...

Nvidia’s new technique cuts LLM logic costs by 8x without losing accuracy

The bottleneck of reasoning

Dynamic memory sparsification

DMS in action

The future of memory

itstargetnews

Related Posts

OpenAI President Defends Trump Donations, Declines to Comment on ICE

Our Favorite TV Is Almost Halfway Over

Leave a Reply Cancel reply

You Missed

OpenAI President Defends Trump Donations, Declines to Comment on ICE

African Union holds new summit, Trump is ‘elephant in the room’ African Union News

Europe worries that Trump poses a threat to its financial and technological sovereignty

“He is a double agent” and “needs to be sold as soon as possible”

Client Challenge

Our Favorite TV Is Almost Halfway Over

Gavin Newsom’s shutdown of New Hampshire book tour sparks speculation about 2028 presidential election

The Bangladesh Nationalist Party is poised for a decisive victory in the landmark parliamentary elections

Avita Medical outlines 2026 revenue target of $80 million to $85 million as reimbursement clarity supports multi-product utilization