OpenAI's announcement of GPT-2 in February 2019 was different from previous AI announcements. They described a model capable of generating coherent, convincing text and then declined to release it fully, citing concerns about misuse. This was unprecedented and polarising.
The samples they released were startling. Given a prompt about a discovery of unicorns in the Andes, the model produced a news article that sounded plausible. Given a technical prompt, it produced technical-sounding prose. The text was not perfect but it was coherent and confident in a way that previous language models had not been.
The decision not to release the full model generated significant debate. Some researchers argued it was paternalistic and that the research community should be trusted. Others argued it set an important precedent for thinking about the capabilities and risks of AI systems before releasing them. OpenAI's position was somewhere in the middle: staged release, starting with the smallest version and evaluating misuse before releasing larger ones.
What I found most interesting was the sample text itself. Reading it felt odd. The model had no understanding of what it was writing but the outputs looked like understanding from the outside. A competent human writer producing the same text would have thought about the words, their meaning, their implications. The model was doing something fundamentally different but producing something that resembled the output.
This raised questions that mattered beyond AI research. How would educational institutions handle work that might be AI-generated? How would journalism verify that quotes and sources were real? How would social media platforms deal with AI-generated propaganda? These questions were nascent in 2019 but would become urgent by 2022.
The technical achievement itself was less surprising to people following the field. Language models had been getting better with scale for years. More parameters, more data, more compute produced better outputs. GPT-2 was a point on a curve, not a discontinuity. The discontinuity was in public awareness.
GPT-2 was also significant for what it told us about the path ahead. If a 2019 model could produce this quality of text, what would a 2022 model produce? The answer, as anyone who used ChatGPT would discover three years later, was that the improvement trajectory was steep. GPT-2 was the first public glimpse of what was coming.