
Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more
The latest OpenAI o3 model achieved a breakthrough that shocked the AI research community. The o3 scored an unprecedented 75.7% on the super-tough ARC-AGI benchmark under standard compute conditions, with the high-compute version reaching 87.5%.
While the success of ARC-AGI is impressive, it has not proven that the code of artificial general intelligence (AGI) broke.
Abstract Reasoning Corpus
The ARC-AGI benchmark is based on Abstract Reasoning Corpuswhich tests the ability of AI systems to adapt to new tasks and demonstrates fluid intelligence. ARC consists of a series of visual puzzles that require an understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle with them. ARC has long been considered one of the most challenging steps in AI.

ARC is designed in such a way that it cannot be fooled by training models on millions of examples in the hope of covering all possible combinations of puzzles.
The benchmark consists of a public training set with 400 simple examples. The training set is complemented by a public evaluation set containing 400 highly challenging puzzles as a means of evaluating the generalizability of AI systems. The ARC-AGI Challenge consists of private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without the risk of leaking data to the public and contaminating future systems with prior knowledge. In addition, the competition sets limits on the number of computing participants available to ensure that the puzzles cannot be solved by brute-force methods.
A success in solving new tasks
The o1-preview and the o1 scored a maximum of 32% in ARC-AGI. Another method used by the researcher Jeremy Berman used a hybrid method, combining Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.
In a blog postFrançois Chollet, the creator of ARC, described the performance of o3 as “an amazing and important step forward in the capabilities of AI, which shows a new ability to adapt to the task that has not been seen before. previously in the GPT-family models.”
It is important to note that the use of additional calculations in previous generations of models does not achieve these results. For context, it took 4 years for the models to go from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. While we don’t know much about architecture of o3, we can be sure that it is not orders of magnitude greater than its predecessors.

“This is not just an additional development, but a real breakthrough, marking a qualitative shift in AI capabilities compared to the previous limitations of LLMs,” Chollet wrote. “o3 is a system that is able to adapt to tasks that it has not previously known, which can approach human-level performance in the ARC-AGI domain.”
It is worth noting that o3’s performance in ARC-AGI comes at a high cost. In the low-compute configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, while in the high-compute budget, the model uses about 172X more computation and billions of tokens. token per problem. However, as inference costs continue to decrease, we can expect these numbers to become more reasonable.
A new paradigm in LLM reasoning?
The key to solving new problems is what Chollet and other scientists call “program synthesis.” A cognitive system must be able to create small programs for solving specific problems, then combine these programs to solve more complex problems. Classic language models absorb a lot of knowledge and have a large set of internal programs. But they lack composition, which prevents them from thinking about puzzles beyond their training distribution.
Unfortunately, there is very little information about how o3 works under the hood, and here, the opinions of scientists differ. Chollet speculates that o3 uses a type of synthesis in the program used chain-of-mind (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions as the model generates tokens. It is similar to what open source reasoning models which has been exploring for the past few months.
Some scientists like Nathan Lambert from the Allen Institute for AI suggests that “o1 and o3 may simply be forward passes from a language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted by X that o1 is “just an LLM trained in RL. o3 is powered by further amplifying RL beyond o1.”

On the same day, Denny Zhou from Google DeepMind’s reasoning team called the combination of search and current reinforcement learning a “dead end.”
“The most beautiful thing about LLM reasoning is that the thought process is generated in an autoregressive way, instead of relying on finding (eg mcts) in the generation space, even through a well-structured model or a carefully designed inducement,” he posted by X.

While the details of how o3 factors seem sparse compared to the ARC-AGI breakdown, it might be good to know the next paradigm shift in the training of LLMs. There is currently a debate on whether the laws of scaling LLMs through training data and computing have hit a wall. Whether scaling the test time depends on better training data or different inference architectures can determine the next path forward.
Not THROUGH
The name ARC-AGI is misleading and some equate it with solving AGI. However, Chollet emphasized that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not equivalent to achieving AGI, and, frankly, I don’t think o3 AGI is there yet,” he wrote. “o3 still fails at some easyrdinal, simple differences in o3’s fundamental differences in human intelligence.
Furthermore, he notes that o3 cannot learn these skills autonomously and it relies on external verifiers during inference and human-labeled reasoning chains during training.
Other scientists have pointed out flaws in OpenAI’s reported results. For example, the model is fine-tuned to the ARC training set to achieve state-of-the-art results. “The solver should not require specific ‘training’, either in the domain itself or in each specific task,” the scientist wrote. Melanie Mitchell.
To verify if these models have the kind of abstraction and reasoning that the ARC benchmark was created to measure, Mitchell suggests “see if these systems can adapt variants to specific tasks or to reasoning tasks using the same concepts, but in other domains than ARC.”
Chollet and his team are currently working on a new challenging benchmark for o3, potentially reducing its score to less than 30% even with a high calculated budget. Meanwhile, people can solve 95% of puzzles without any training.
“You’ll know AGI is here when the exercise of doing tasks that are easy for regular humans but difficult for AI becomes impossible,” Chollet wrote.
Source link