When Everything Became Multimodal

The models that emerged in the first half of 2025 made the text-in, text-out assumption obsolete for most practical purposes. Vision capability in frontier models went from an interesting add-on to a baseline expectation. Voice interfaces moved from voice-to-text to genuine speech understanding and generation at quality levels that change what is worth building. The frontier shifted from what can the model understand to what can the model perceive.

The practical implications took a while to surface in products. The availability of multimodal capability does not immediately suggest what to build with it. Early applications were often demonstrations of capability rather than solutions to real problems. Take a photo and ask about it. Describe an image. Generate audio from text. Useful in isolation but not yet the reshaping of how applications work that the underlying capability suggested was possible.

What I started seeing in mid-2025 was the quieter applications. Document processing workflows that had always required humans to handle anything non-textual started to genuinely automate. Insurance claim processing, medical imaging report drafting, quality control inspection in manufacturing. All of these had multimodal AI in the workflow in ways that reduced human involvement in specific, bounded steps rather than trying to remove humans from the entire process.

The voice interface development was perhaps more disruptive to more products than the vision capability, simply because so many existing products are built around text interaction. When voice becomes a genuine first-class interface rather than voice-to-text bolted onto a text product, the design assumptions change. Navigation, confirmation, error handling. All of it needs rethinking.

I spent time with a team building a customer support application that had always been primarily text-based. They were experimenting with voice, and the shift was not just in adding a microphone. It was in how conversations needed to be structured differently. Text conversations can be scanned. Voice conversations are linear. Information that works in a written response does not always work the same way when spoken. Those design differences are not trivial.

What multimodal capability ultimately changes is the definition of what counts as an input. For a decade, AI applications assumed the primary input was text the user typed. That assumption is now worth challenging for almost any product category.

Related Articles