I spent a lot of time in late 2023 and early 2024 figuring out what multimodal AI actually changed in practice. The theoretical use cases were obvious: analyse images, read documents with complex layouts, understand charts and graphs. The reality was richer and stranger than I expected.
The most immediate practical use case for me was debugging. Sharing a screenshot of a broken UI or an error message with the model and asking what was wrong worked better than I expected. The model could read error messages, identify visual inconsistencies, and suggest fixes without requiring me to type out every detail. The reduction in friction for "show me the problem" conversations was real.
For document understanding, GPT-4V was particularly impressive. PDFs with tables, handwritten notes, forms with unusual layouts: things that were painful or impossible to process with OCR before the vision model could handle. A scanned invoice could be converted to structured data. A whiteboard photo from a meeting could be transcribed. These were not perfect but they were good enough to be useful.
The healthcare implications caught my attention particularly. Medical imaging analysis, reading handwritten doctor's notes, interpreting lab reports in their visual format: these use cases had been discussed for years in the context of specialised models. A general multimodal model performing reasonably on medical images suggested the specialised model approach might be bypassed in some areas.
The accessibility implications were also significant and underappreciated. A user who cannot read text in an image (because of a visual impairment or because the image is in a different language) could ask the model to describe or translate it. Accessibility features that would have required specialised software were available through a general-purpose interface.
The cost and latency of vision inputs was higher than text-only prompts, which affected architecture decisions. For applications processing many images, the economics required careful consideration. Running vision analysis on every frame of a video was not viable; selecting key frames intelligently was.
By mid-2024, multimodal capability was less a differentiating feature and more a baseline expectation for frontier models. The question was not whether a model could see but how well it performed on specific visual tasks.