Data Modality

Data modality refers to the distinct types of input data that AI systems process and interpret. In the context of large language models (LLMs) and multimodal AI systems, data modality encompasses text, images, audio, and video—each representing a different form of information that a system must learn to understand and analyze. Traditionally, LLMs operated exclusively on text, but advances in multimodal architectures have enabled single models to process multiple data types simultaneously.

Types of Modalities

Common data modalities include text (written language), images (visual information), audio (spoken language and sounds), and video (temporal sequences of visual information). Each modality carries distinct characteristics: text is discrete and symbolic, images are spatial and continuous, audio is temporal and continuous, and video combines spatial and temporal dimensions. Some systems also incorporate other specialized modalities such as structured data, point clouds, or sensor readings, depending on their application domain.

Processing Multimodal Data

Multimodal AI systems must convert different modalities into compatible representations before processing them together. This typically involves separate encoders for each modality that transform diverse input types into a shared vector space or latent representation. The model then processes these unified representations to enable tasks like image captioning, visual question answering, or cross-modal retrieval, where understanding requires information from multiple input types.

Source Notes