During the 12 days of Shipmas, OpenAI unveiled its latest advancements in artificial intelligence with the announcement of the o3 model and its counterpart, o3 Mini. These models improve reasoning capabilities over prior models, offering developers new opportunities to solve increasingly complex tasks. o3 sets a new benchmark in technical performance, particularly in coding and mathematics.
On SWE-Bench Verified, a coding benchmark comprised of real-world software tasks, o3 achieves a 71.7% accuracy—over 20% better than o1. Similarly, on Codeforces, a competitive programming platform, o3 scores a 2727 ELO under high compute settings. On the American Invitational Mathematics Examination (AIME) benchmark, the model achieves 96.7% accuracy, a leap from the 83.3% attained by o1.
On the ARC dataset, an example of which is shown above, problems are designed to test an AI system’s ability to adapt to novel tasks. o3 scored 75.7% on the Semi-Private Evaluation set under the competition’s $10k compute budget (around $20 per task) and 87.5% at high-compute configurations ($2000-$3000 per task). The performance/cost tradeoff is depicted in the figure below. The ARC-AGI benchmark is specifically a challenge that previous models failed to address. o3 employs a paradigm that integrates natural language program search and execution during test time, reminiscent of techniques like AlphaZero’s Monte Carlo tree search and is also guided by a deep learning-based evaluator. François Chollet, the creator of the benchmark, noted the progress made by o3 while also the ongoing room for improvement.
I don’t think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. .. the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval. – François Chollet
These advancements necessitate more challenging benchmarks as the model still struggles with certain simple tasks that humans find trivial. OpenAI has focused on addressing this need by tackling Epoch AI’s Frontier Math Benchmark. Tamay Besiroglu of EpochAI said this “arrives about a year ahead of my median expectations.” With aggressive compute settings, o3 achieves ~25% accuracy. Early testing on the forthcoming ARC-AGI-2 benchmark suggests that o3 could face significant challenges, with predictions under 30% even at high compute levels.
The hype around o3 is out of control. It’s not AGI, it’s not the singularity, and you definitely don’t have to change your worldview. – Elvis Saravia
OpenAI’s development of its next-generation AI model, codenamed Orion, has encountered other hurdles. The anticipated GPT-5 model, initially expected to launch in early 2024, remains delayed as engineers grapple with rising costs, limited data, and design challenges. The growing complexity of building and training such models has pushed the estimated costs of GPT-5’s development to over $1 billion.
o3 Mini offers scalable thinking time options—low, medium, and high—allowing developers to balance performance with cost and latency. o3 Mini excels in code generation and problem-solving, achieving competitive ELO ratings on Codeforces and matching or exceeding o1’s performance at a fraction of the cost.
o3 Mini’s adaptability is exemplified in live demonstrations, where it efficiently generates complex Python scripts for automated tasks. In one example, the model created a local server that processed coding requests, executed the code, and displayed results. Such functionality demonstrates o3 Mini’s utility for streamlining development workflows and automating intricate processes.
Safety remains a top priority for OpenAI as these powerful models are developed. Through a “Deliberative Alignment” approach, o3 demonstrates the ability to explicitly reason over safety policies before responding to prompts, enhancing both compliance and adaptability. By integrating chain-of-thought (CoT) reasoning into its training process, the model has started to balance safety and utility in everyday use.
Developers interested in learning more about the new reasoning models may continue to monitor InfoQ into the new year. o3 and o3 Mini are slated for wider availability in early 2024, with o3 Mini expected to launch by the end of January and o3 following shortly after. Until then, developers and researchers can apply for early access through OpenAI’s safety testing program.