DeepSeek-V3, ultra-large open-source AI, surpasses Llama and Qwen at launch


Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more


Chinese AI startup DeepSeek, known for challenging leading AI vendors with new open source technologies, today released a new ultra-large model: DeepSeek-V3.

Available through Hugging the Face Under the company’s license agreement, the new model has 671B parameters but uses a mixture of experts’ architecture to activate only selected parameters, to manage given tasks accurately and efficient. According to benchmarks shared by DeepSeek, the offer is already topping the charts, outperforming leading open-source models, including The Llama of Meta 3.1-405Band closely matching the performance of closed models from Anthropic and OpenAI.

The release marks another major development that closes the gap between closed and open source AI. Finally, DeepSeek, which started as an offshoot of a Chinese quantitative hedge fund High-Flyer Capital Managementhopes that these developments will pave the way for artificial general intelligence (AGI), where models have the ability to understand or learn any intellectual task that a human can perform.

What does DeepSeek-V3 bring to the table?

Like its DeepSeek-V2 predecessor, the new ultra-large model uses the same basic surround architecture. multi-head latent attention (MLA) and DeepSeekMoE. This approach ensures that it maintains efficient training and inference – with specialized and shared “experts” (individual, smaller neural networks within a larger model) that activate 37B parameters from of 671B for each token.

While the basic architecture ensures strong performance for DeepSeek-V3, the company also debuted two innovations to keep pushing the bar.

The first is an auxiliary loss-free load-balancing strategy. It dynamically monitors and adjusts the load of experts to use it in a balanced way without compromising the overall performance of the model. The second is multi-token prediction (MTP), which allows the model to predict many future tokens simultaneously. This innovation not only improves training efficiency but enables the model to perform three times faster, generating 60 tokens per second.

“During pre-training, we trained DeepSeek-V3 on 14.8T high-quality and diverse tokens…Next, we conducted a two-stage context extension for DeepSeek-V3,” wrote the company of a technical paper details of the new model. “In the first stage, the maximum context length was extended to 32K, and in the second stage, it was further increased to 128K. After this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the DeepSeek-V3 base model, to adapt it to human preferences and further unlock its potential. During the post-training phase, we distilled the reasoning ability from DeepSeekR1 series of modelsand in the meantime carefully maintain a balance between model accuracy and generation length.”

In particular, in the training phase, DeepSeek uses several hardware and algorithm optimizations, including the FP8 integrated precision training framework and the DualPipe algorithm for pipeline parallelism, to reduce process costs. .

In total, it claims to have completed the entire training of DeepSeek-V3 in about 2788K H800 GPU hours, or about $5.57 million, assuming a rental price of $2 per GPU hour. This is far less than the hundreds of millions of dollars typically spent on pre-training large language models.

Llama-3.1, for example, is estimated to have been trained with an investment of over $500 million.

The most robust open-source model available today

Despite the economical training, DeepSeek-V3 has emerged as the strongest open-source model on the market.

The company ran several benchmarks to compare the performance of the AI ​​and noted that it convincingly outperformed the leading open models, including Llama-3.1-405B and Qwen 2.5-72B. It’s even more than closed-source GPT-4o in most benchmarks, except for the English-focused SimpleQA and FRAMES – where the OpenAI model sits ahead with scores of 38.2 and 80.5 (vs 24.9 and 73.3), respectively.

In particular, the performance of DeepSeek-V3 especially stands out in the Chinese and math-centric benchmarks, getting better than all the counterparts. On the Math-500 test, it scored 90.2, with Qwen’s score of 80 the next best.

The only model that was able to challenge DeepSeek-V3 was Anthropic’s Claude 3.5 Sonnetsurpasses it with higher scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit.

The work shows that open-source closes closed-source models, promising almost equivalent performance in different tasks. The development of such systems is very good for the industry as it can eliminate the chances of a big AI player ruling the game. It also gives businesses more options to choose from and use while orchestrating their stacks.

Currently, the code for DeepSeek-V3 is available via GitHub under the MIT license, while the model is provided under the company’s model license. Businesses can also test the new model through DeepSeek Chata ChatGPT-like platform, and API access for commercial use. DeepSeek provides the API to same price as DeepSeek-V2 until February 8. After that, it will charge $0.27/million input tokens ($0.07/million tokens with cache hits) and $1.10/million output tokens.



Source link
  • Related Posts

    ‘Tulsa King’: How to Watch Season 3, Episode 9

    Tulsa King Season 3 is here, and the exploits of Sylvlesster Stallone’s Oklahoma crime crew and his allies are far from over. PARAUNT Plus has renewed Taylor Sheridan-Get Drama for…

    Sea urchins are basically brains covered in wings, studies have found

    When it comes to taking evolutionary paths to hardcore, some animals don’t hold back. The common sea urchin, as it turns out, that flows to this point home-boasts a spiky…

    Leave a Reply

    Your email address will not be published. Required fields are marked *