OpenAI has confirmed the new frontier models o3 and o3-mini

Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more

OpenAI is gradually inviting selected users to test a new set of reasoning models named o3 and o3 mini, successors to the o1 and o1-mini models recently. only went into full release earlier this month.

OpenAI o3, named to avoid copyright issues with phone company O2 and because CEO Sam Altman says the company “has a tradition of being really bad with names,” was announced last day on “12 Days of OpenAI” livestreams today.

Altman said the two new models will first be released to select third-party researchers for safety testwith o3-mini expected by the end of January 2025 and o3 “shortly thereafter.”

“We see this as the beginning of the next phase of AI, where you can use these models to do more complex tasks that require a lot of reasoning,” Altman said. “For the last day of this event we thought it would be fun to go from one frontier model to the next frontier model.”

The announcement comes a day after Google opened and allowed the public to use it new Gemini 2.0 Flash Thinking modelanother rival “reasoning” model that, unlike the OpenAI o1 series, allows users to see the steps of his “thinking” process documented in text bullet points.

The release of Gemini 2.0 Flash Thinking and now the announcement of o3 show that the competition between OpenAI and Google, and the wider field of AI model providers, is entering a new and intense phase as they offers not only LLMs or multimodal models, but advanced ones. reasoning models too. This will be more applicable to more difficult problems in science, math, technology, physics and so on.

The best performance in third-party benchmarks

Altman also says the o3 model is “incredible at coding,” and benchmarks shared by OpenAI back that up, showing the model outperforms the o1 in programming tasks.

• Outstanding Coding Performance: o3 outperformed o1 by 22.8 percentage points in SWE-Bench Verified and achieved a Codeforces rating of 2727, surpassing OpenAI’s Chief Scientist score of 2665.

• Math and Science Mastery: The o3 scored 96.7% in the AIME 2024 exam, missing only one question, and achieved 87.7% in GPQA Diamond, surpassing the performance of a human expert.

• Frontier Benchmarks: The model set new records in challenging tests such as EpochAI’s Frontier Math, solving 25.2% of problems where no other model exceeded 2%. In the ARC-AGI test, o3 triples o1’s score and exceeds 85% (as verified live by the ARC Prize team), which represents a milestone in conceptual reasoning.

Deliberative alignment

With these improvements, OpenAI strengthens its commitment to safety and alignment.

Introducing the company new research on deliberative alignmenta technique instrumental in making the o1 the most durable and adaptable model to date.

This technique embeds human-written safety specifications into the models, enabling them to explicitly reason about these policies before generating responses.

The strategy seeks to solve common security challenges of LLMs, such as vulnerability to jailbreak attacks and excessive denial of malicious prompts, by equipping models with chain-of -thought (CoT) reasoning. This process allows models to remember and use safety details dynamically during inference.

Deliberative alignment improves on previous methods such as reinforcement learning from human feedback (RLHF) and constitutional AI, which rely on safety specifications only for label generation rather than embedding policies directly into models.

By fixing LLMs in safety-related stimuli and their associated details, this approach creates models that enable policy-driven reasoning without too much trust. of data with a human label.

Results shared by OpenAI researchers in a new, non-peer-reviewed paper shows that this approach can improve performance on safety benchmarks, reduce harmful outputs, and ensure better compliance with content and style guidelines.

Key findings highlight the improvements of the o1 model over its predecessors such as the GPT-4o and other modern models. The deliberative alignment enables the o1 series to successfully resist jailbreaks and provide secure completion while minimizing excessive denial of benign prompts. In addition, the method facilitates out-of-distribution generalization, showing robustness in multilingual and encoded jailbreak scenarios. These improvements are in line with OpenAI’s goal of making AI systems safer and more understandable as their capabilities grow.

This research will also play an important role in aligning o3 and o3-mini, ensuring that their capabilities are both powerful and responsible.

How to apply for trial access to o3 and o3-mini

Applications for early access are now open at OpenAI website and will close on January 10, 2025.

Applicants must fill up online shape asking them for a variety of information, including research studies, past experience, and links to previously published papers and their code repositories on Github, and choose which of the models — o3 or o3-mini — they want to try, as well as how they plan to use it.

Selected researchers will be given access to o3 and o3-mini to test their capabilities and contribute to safety evaluations, although the OpenAI form warns that o3 will not be available for several weeks.

Researchers are encouraged to conduct robust evaluations, conduct controlled demonstrations of high-risk capabilities, and test models of scenarios that are not feasible using widely adopted tools.

This initiative builds on the company’s established practices, including rigorous internal safety testing, collaborations with organizations such as the US and UK AI Safety Institutes, and its Readiness Framework.

OpenAI will review applications on a rolling basis, with selections beginning immediately.

A new leap forward?

The introduction of o3 and o3-mini marks a leap forward in AI performance, especially in areas that require advanced reasoning and problem-solving capabilities.

With their outstanding results in coding, math, and conceptual benchmarks, these models highlight the rapid progress made in AI research.

By inviting the broader research community to collaborate on safety testing, OpenAI aims to ensure that these capabilities are deployed responsibly.

Watch the stream below:

Daily insights into business use cases in VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. See more VB newsletters here.

An error occurred.