By 2024 there was a distinct set of skills and knowledge required to build production AI applications that did not map cleanly onto either software engineering or machine learning. Chip Huyen wrote about this and the term AI Engineering started to appear in job descriptions. I have been thinking about what this discipline actually is.
Traditional machine learning engineering was about building and deploying models: data pipelines, feature engineering, training infrastructure, model serving. The primary artifact was a model you built yourself.
AI engineering in 2024 was about building applications on top of models you did not build: prompt engineering, retrieval architectures, evaluation frameworks, cost optimisation, reliability engineering for probabilistic systems. The primary artifacts were prompts, pipelines, and evaluation suites.
The skills required were different. You needed to understand how language models behaved under different conditions. How they failed and why. How to write prompts that consistently produced the output you needed. How to evaluate whether a change to your prompts improved or degraded output quality. How to build guardrails that caught bad outputs before they reached users.
Evaluation was the hardest part. Traditional software has test suites where a test either passes or fails. LLM application outputs are probabilistic and subjective. The same prompt produces different outputs across runs. "Better" is often a matter of judgment. Building evaluation infrastructure that could reliably tell you whether a change improved the system required significant thought and often required using another LLM as a judge.
Observability had to be rebuilt for LLM systems. Traditional application observability measured latency, error rates, and resource utilisation. LLM application observability needed to measure prompt quality, retrieval relevance, response quality, token costs, and hallucination rates. New tools like LangSmith, Phoenix, and Helicone emerged to address this.
Cost management was a discipline in itself. An LLM application that made many calls to GPT-4 could generate surprisingly large bills. Caching, prompt compression, routing cheaper models to simpler queries, and setting token limits all required understanding of both the cost structure and the quality trade-offs.
The people who were good at AI engineering in 2024 combined software engineering discipline with empirical science habits: experiment, measure, iterate. The ability to build a hypothesis about prompt behaviour, test it systematically, and draw valid conclusions from the results was as important as the ability to write clean code.