
After nearly two weeks of announcements, OpenAI concluded its 12 Days of OpenAI livestream series with a preview of its next-generation frontier model. “Out of respect for our friends at Telefónica (owner of the O2 cellular network in Europe), and in the great tradition of OpenAI being really, really bad at names, it’s called o3,” said OpenAI’s CEO. Sam Altman among those watching the announcement on YouTube.
The new model is not yet ready for public use. Instead, OpenAI first made o3 available to researchers who wanted to help safety test. OpenAI also announced the existence of o3-mini. Altman said the company plans to launch that model “by the end of January,” with the o3 following “shortly after that.”
As you might expect, the o3 offers better performance than its predecessor, but how much better it is than the o1 is the headline feature here. For example, when put through this year American Invitational Mathematics Examinationo3 achieved an accuracy score of 96.7 percent. In contrast, o1 got a more modest 83.3 percent rating. “What this means is that o3 is always missing a question,” said Mark Chen, senior vice president of research at OpenAI. In fact, o3 performed so well in the usual suite of benchmarks that OpenAI puts its models through that the company had to find more challenging tests to benchmark it against.
One of them is ARC-AGIa benchmark that tests the ability of an AI algorithm to intuit and learn on the fly. According to the creator of the test, the non-profit ARC Prizean AI system that successfully defeats ARC-AGI would represent “an important milestone toward artificial general intelligence.” Since its inception in 2019, no AI model has won ARC-AGI. The test consists of input-output questions that most people will know intuitively. For example, in the example above, the correct answer is to make squares out of four polyominos using dark blue blocks.
In its low-compute setting, the o3 scored 75.7 percent in the test. With more processing power, the model achieved a rating of 87.5 percent. “Human performance is comparable to the 85 percent threshold, so surpassing it is an important milestone,” according to Greg Kamradt, president of the ARC Prize Foundation.
OpenAI also features o3-mini. The new model uses OpenAI’s recently announced Adaptive Thinking Time API to offer three different reasoning modes: Low, Medium and High. In practice, this allows users to adjust how long the software “thinks” about a problem before providing an answer. As you can see from the above graph, o3-mini achieves results comparable to OpenAI’s current o1 reasoning model, but at a fraction of the computational cost. As mentioned earlier, the o3-mini will arrive for public use before the o3.