GenAIWiki

Models

Multimodal AI

Multimodal AI works with more than one data modality, such as text, images, audio, video, documents, or structured data.

Expanded definition

Multimodal AI systems can understand, generate, or transform information across modalities. Examples include visual question answering, image generation, speech transcription, document understanding, video analysis, and assistants that combine text with screenshots or files. Model capability varies by supported inputs, outputs, context length, and tool integrations.

Related terms

Explore adjacent ideas in the knowledge graph.

Related

Comparisons, tools, and models that connect to this idea.