Article 6 — Multimodal AI: teaching machines to see, hear, and imagine
Article 6 — Multimodal AI: teaching machines to see, hear, and imagine
In Article 5, we explored how AI models learn — from tokenization to backpropagation. Now, we take a leap into multimodal AI — the technology that enables models to work with multiple types of data: text, images, audio, and video.
Multimodal AI refers to models that can process and understand multiple forms of data simultaneously. Instead of just interpreting text or images in isolation, these models can combine and relate information from various modalities.
Think of it like how humans can read text, look at images, and listen to sounds at the same time to make sense of the world.
1. Shared Representations
Multimodal models learn shared representations where different types of data are mapped to a common space.
For example, both text and image embeddings are represented in the same vector space so the model can compare them.
2. Cross-Modal Attention
A key feature that allows the model to focus on relevant parts of each modality. For example, when describing an image, the model can “look” at specific regions of the image while processing the text.
3. Multimodal Transformers
Transformers, as discussed in Article 4, are the backbone of many multimodal models.
These transformers extend traditional attention mechanisms to handle data from multiple modalities at once.
CLIP (Contrastive Language-Image Pre-training)
Developed by OpenAI, CLIP can understand both images and text.
It learns to associate images with their textual descriptions, so you can ask it to “find images of a cat wearing a hat” and it will match the description to images.
DALL·E
Another OpenAI model, DALL·E can generate images from text prompts.
For example, you could say, “A two-story house shaped like a shoe,” and DALL·E will create that image from scratch.
Whisper
Whisper is a speech-to-text model by OpenAI, capable of transcribing audio to text.
It supports multiple languages and handles audio nuances, like background noise or accents.
Sora
Sora can generate video from text, combining vision and language models to create videos from written descriptions. It’s a glimpse into the future of AI creativity.
Input Is Tokenized
Each modality (text, image, audio) is tokenized into meaningful chunks.
Text: Words are split into tokens.
Images: Pixels are grouped into patches or embeddings.
Audio: Waveforms are converted to spectrograms (visual representations).
Shared Embedding Space
All modalities are projected into a shared vector space, where similar information across different modalities lies close together.
Cross-Modal Interaction
Attention layers interact with both text and image tokens at the same time, enabling the model to link and cross-reference between different types of data.
Model Processes Information
Similar to single-modality transformers, the information is passed through multiple layers, with the model learning deep, cross-modal associations.
Final Output
The output could be a prediction or transformation based on input from any or all modalities (e.g., generating a caption from an image, translating audio to text, or creating an image from a text prompt).
Holistic Understanding
Humans don’t think in just one modality; we combine sight, sound, and language to understand the world. Multimodal AI mimics this.
Richer Context
Having access to more types of data allows models to develop a richer understanding of the context. For example, understanding a scene in a video is easier when the model knows both what’s in the image and what’s being said.
Generalization
Multimodal AI is more flexible — able to perform a variety of tasks across different domains by using multiple types of data.
Here are some exciting use cases:
Medical Diagnosis
AI can combine medical images (like X-rays) with patient records to provide more accurate diagnoses.
Autonomous Vehicles
Cars use cameras (vision), microphones (audio), and sensor data to navigate the world.
AI Content Creation
AI like DALL·E and GPT-4 can collaborate in generating creative content — text, images, and even videos.
Next, we’ll dive into AI Optimization Techniques — how do we make these powerful models more efficient? We’ll explore methods like pruning, quantization, and transfer learning.
Multimodal AI enables models to understand and combine text, images, audio, and even video. This approach allows AI to create deeper insights and drive richer, more diverse applications, from content generation to autonomous systems.