At the start of 2024, GPT-4 was the clear benchmark. By the end of it, the picture had changed considerably. Multiple models from different organisations had closed the gap in benchmarks. Open source alternatives had become competitive on several tasks. The cost of inference had dropped significantly. What looked like a winner-takes-all market twelve months earlier looked more like a competitive field by December.
This matters because it changes the strategic calculation for anyone building with these models. A year ago, choosing GPT-4 was almost a default decision for applications that needed high capability. By the start of 2025, that choice requires more deliberation. The open alternatives are genuinely capable. The cost difference is real. For some applications the proprietary models are still clearly better. For others, the gap has closed enough that the cost and control advantages of running your own model tilt the decision.
What struck me most following this shift was not the technical progress, which was expected, but how quickly the benchmarking ecosystem became unreliable as a decision tool. Models were being optimised for benchmark performance in ways that did not always translate to real-world task performance. Several evaluation frameworks that seemed robust a year ago are now understood to have significant weaknesses. The gap between benchmark scores and what actually happens in production has become a practical problem for teams trying to make model selection decisions.
The context window expansion deserves mention. Most serious use cases that ran into problems with short context windows have now had those problems solved at the infrastructure level rather than requiring clever workarounds in the application code. A model that can hold a genuinely long conversation, or process a large document in a single pass, changes what you can build without complicated chunking and retrieval architecture.
What I expect to see more of in 2025 is specialisation. The general-purpose frontier models will keep improving, but the more interesting development will be purpose-built models for specific domains. Medical, legal, financial, software engineering. Not because the general models cannot handle these domains, but because fine-tuned specialist models can be smaller, cheaper, faster, and more reliable within their defined scope. The economics of that will appeal to serious production applications where cost and latency matter.
The challenge for anyone building in this space right now is that the infrastructure is evolving faster than best practices. The patterns that made sense six months ago need revisiting. That is uncomfortable but it is also where the opportunity is. Teams that are genuinely learning and adapting rather than copying last year's architecture decisions are building things that will hold up.