When OpenAI released o1 in September 2024, the benchmark improvements on mathematics and coding were striking. Not incremental: substantial. The difference between o1 and GPT-4 on competition mathematics was the difference between an average student and an expert. The key was the model reasoning before answering.
Chain-of-thought reasoning had been known to improve performance since 2022. But o1 took this to a different level: the model was trained specifically to reason through problems in an extended internal process before producing a response. This reasoning was not shown to users but it happened, and its duration was proportional to problem difficulty. For hard problems, the model might reason for thirty seconds or more before responding.
The practical implications took time to understand. For tasks where o1 was better, it was substantially better: multi-step mathematics, complex coding tasks, reasoning about edge cases in specifications. For tasks where GPT-4 was adequate (writing, summarisation, simple coding, conversation), o1 offered little improvement at significantly higher cost and latency.
DeepSeek R1, released in January 2025 by the Chinese AI lab DeepSeek, was significant for a different reason. It matched or exceeded o1 performance on many benchmarks at a fraction of the training cost, and was released as an open-weights model. The efficiency of DeepSeek's training raised serious questions about whether the enormous compute investments of OpenAI and Anthropic were necessary for frontier performance.
The reasoning model pattern has clarified what kind of problems benefit from extended reasoning: problems with objectively verifiable correct answers, where intermediate steps can be checked, and where the answer space is constrained. Proof verification, code correctness, formal logical arguments.
The problems that do not obviously benefit from reasoning models: tasks that require creativity rather than correctness, tasks where there is no clear right answer, tasks where speed matters more than accuracy.
For AI engineering practice, reasoning models introduced new considerations. You could not always predict how long a reasoning model would take to respond. The cost of a single query could be much higher than a standard model. But for specific hard problems, the improvement in quality was worth the cost and the engineering challenge was designing systems that used reasoning models selectively.