Text
Text is a fundamental data modality that serves as the primary input and output format for large language models (LLMs). LLMs are specifically architected to process sequences of tokens—discrete units representing words, subwords, or characters—and generate new text sequences based on learned patterns from training data. This token-based processing remains the core computational mechanism of language models, even as these systems have expanded to handle additional modalities.
Text in Multimodal Systems
In contemporary multimodal AI systems, text processing operates alongside other data formats including images, audio, and video. Text frequently serves as a unifying representation layer, converting information from other modalities into a shared format that LLMs can process. For example, image descriptions or audio transcriptions are converted to text before being integrated into an LLM’s reasoning pipeline. This makes text a practical bridge between different data types within integrated AI systems.
Role in AI Agents
In AI agent applications, text functions as both the medium for instructions and the primary output mechanism. Users typically communicate goals and context through natural language text, which agents parse and reason about. Text generation remains the standard method for agents to produce responses, explanations, and action specifications, making it essential to agent usability and interpretability.
Source Notes
- 2026-04-10: What is Multimodal AI? How LLMs Process Text, Images, and
- 2026-04-07: Multimodal AI Concepts Approaches and Data Processing by LLMs · ▶ source