GPT-4 Arrives and the Benchmark Moves Again

On March 14, 2023, OpenAI released GPT-4, the next major version of its flagship language model. The release came four months after the launch of ChatGPT and made clear that the rapid progress in capability that had defined late 2022 was not slowing down. GPT-4 was substantially better than GPT-3.5 on many tasks, included multimodal capabilities that allowed it to process images alongside text, and demonstrated reasoning patterns that had not been reliably present in earlier models.

The benchmarks were a useful starting point even though they were already being criticised for not capturing the actual user experience well. GPT-4 scored in the top ten percent on a simulated bar examination. It performed well on Olympiad-level mathematics problems. It improved significantly on coding benchmarks that GPT-3.5 had struggled with. The ability to handle longer context windows and maintain coherence across more complex tasks was a clear step forward.

What was less measurable but mattered more in practice was the qualitative experience of using GPT-4 for non-trivial work. The error rate on complex reasoning tasks dropped meaningfully. The model handled instructions with more precision. It was better at admitting when it did not know something, although still not as good as one would want. The hallucination problem was reduced but not solved.

The multimodal capability was demonstrated rather than fully released initially. The image input feature was rolled out gradually. When it became more broadly available later in 2023, it expanded what the model could be asked to do in ways that simple text prompting could not. Developers and users started exploring use cases that combined visual understanding with the language model’s existing capabilities.

The release was also notable for how little OpenAI shared about the underlying model. The technical report described capabilities and safety work but provided minimal detail about the model size, the training data, or the methods used in training. The shift from the relatively open posture of the GPT-3 paper toward a much more guarded posture for GPT-4 reflected the changed competitive environment as much as anything else.

Within days of the release, products built on top of GPT-4 started appearing. The same day, Anthropic announced its competing Claude assistant in expanded availability. The pace at which new capabilities were arriving had become difficult for any single team to track in detail.

Related Articles