
Lowering the cost of inference is often a combination of hardware and software. A new analysis released Thursday by Nvidia details how the four leading inference providers reported 4x to 10x cost reductions per token.
Dramatic cost reductions have been achieved using Nvidia’s Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI show significant cost improvements across healthcare, gaming, agentic chat, and customer service as businesses scale AI from pilot projects to millions of users.
The 4x to 10x cost reduction reported by inference providers requires the integration of Blackwell hardware with two other elements: optimized software stacks and transition from proprietary to open-source models that now fit frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Achieving greater cost reductions requires adopting low-precision formats such as NVFP4 and moving away from closed source APIs that charge premium rates.
Economics proves counter-intuitive. Reducing inference costs requires investing in higher performance infrastructure because improvements in throughput directly translate into lower costs per token.
"Performance is what lowers the cost of inference," Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. "What we see in the inference is that throughput literally translates into real dollar value and lowers costs."
Production deployments show 4x to 10x cost reduction
Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell’s infrastructure, optimized software stacks and open source models deliver cost reductions across a variety of industrial workloads. The case studies cover high-volume applications where the economics of inference directly determine business viability.
Sully.ai has cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on a platform powered by Blackwell in Baseten, according to Nvidia. The company has given back more than 30 million minutes to doctors by automating medical coding and note-taking tasks that previously required manual data entry.
Nvidia also reported that Latitude has reduced the cost of playing the game by 4x for the AI Dungeon platform by running multiple mixture-of-experts (MoE) models in DeepInfra’s Blackwell deployment. The cost per million tokens decreased from 20 cents on Nvidia’s previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell’s native NVFP4 low-precision format. Hardware alone provides a 2x improvement, but achieving 4x requires proper format conversion.
Sentient Foundation achieved 25% to 50% better cost efficiency for its agent chat platform using Fireworks AI’s Blackwell-optimized inference stack, according to Nvidia. The platform orchestrated complex multi-agent workflows and processed 5.6 million queries a week during its viral launch while maintaining low latency.
Nvidia said that Decagon has seen a 6x reduction in cost per inquiry for AI-powered voice customer support by running its multimodel stack on Together AI’s Blackwell infrastructure. Response times remain below 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose confidence.
Technical factors driving 4x vs. 10x improvement
The range from 4x to 10x cost reduction across deployments reflects different combinations of technical optimizations rather than hardware differences. Three factors emerged as the main drivers: precision format adoption, model architecture choices, and software stack integration.
Proper formats show the most obvious effect. The Latitude case directly reflects this. Switching from Hopper to Blackwell delivers a 2x cost reduction through hardware improvements. The adoption of NVFP4, Blackwell’s native low-precision format, doubled the improvement to 4x overall. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works well for MoE models where only a subset of the model is active for each inference request.
Model architecture is important. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell’s NVLink fabric that enables fast communication between experts. "Talking to the experts in that fabric NVLink allows you to reason quickly," Harris said. Dense models that activate all parameters for each inference cannot use this architecture as effectively.
Software stack integration creates additional performance deltas. Harris says Nvidia’s co-design approach — where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together — also makes a difference. Baseten’s deployment for Sully.ai uses this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve a 10x cost reduction. Providers running alternative frameworks such as vLLM may see lower profits.
Workload characteristics are important. Reasoning models show particular advantages to Blackwell because they generate more tokens to achieve better answers. The platform’s ability to process these extended token ranges efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective.
Teams evaluating potential cost reductions should evaluate their workload profiles against these factors. High token generation workloads using mixture-of-experts models with an integrated Blackwell software stack approach the 10x range. The lower number of tokens using dense models in alternative frameworks is closer to 4x.
What teams should try before migrating
While these case studies focus on the deployment of Nvidia Blackwell, businesses have many avenues to reduce inference costs. AMD’s MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue to optimize their inference services. The question is not whether Blackwell is the only option but whether the specific combination of hardware, software and models fits the particular workload requirements.
Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes.
"Businesses need to work back from their workloads and use case and cost constraints," Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat.
The deployments that achieved 6x to 10x improvements all involved high-volume, latency-sensitive applications that processed millions of requests per month. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model changes before considering an infrastructure upgrade.
Testing is more important than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but they represent optimal conditions.
"If it’s a very latency-sensitive workload, they can try a couple of providers and see which one meets the minimum they need while keeping the cost down," he said. Teams should run actual production workloads across multiple Blackwell providers to measure actual performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks.
The staged approach used by Latitude provides a model for evaluation. The company first switched to Blackwell hardware and measured a 2x improvement, then adopted the NVFP4 format to reach a 4x total reduction. Teams that currently have Hopper or other infrastructure can test whether proper formatting changes and software optimizations to existing hardware yield meaningful savings before committing to a full infrastructure migration. Running open source models on existing infrastructure can provide half of the potential cost reduction without new hardware investment.
Choosing a provider requires understanding the differences in the software stack. While many providers offer Blackwell infrastructure, their software implementations vary. Some run on Nvidia’s integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris recognizes the performance delta that exists between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming that all Blackwell deployments are created equal.
The economic equation goes beyond the cost per token. Specialist inference providers such as Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have a higher cost per token but lower operational complexity. Teams should calculate total costs including operational overhead, not just inference pricing, to determine which approach provides better economics for their particular situation.







