Language Data
Textual and symbolic information used to train, evaluate, and fine-tune Language Model, encompassing raw corpora, tokenized sequences, structured datasets, and metadata.
Types & Sources
- Raw Corpora: Unstructured text from web, books, code; foundational for pre-training large-language-model.
- Curated Datasets: Filtered subsets for alignment, safety, and domain specificity.
- Synthetic Data: Machine-generated text to augment reasoning or rare domains.
Processing & Representation
- Tokenization: Discretization of text; vocabulary design affects data efficiency.
- Embeddings: Vector representations of semantic content; critical for model internalization.
Quality & Governance
- Data Quality: Curation and filtering often outweigh volume in performance gains.
- Bias: Inherent biases require mitigation via balancing and adversarial training.
- Licensing: Constraints on usage affect deployment and commercialization.
Architectural Dependencies
- large-language-model architectures depend on next-token prediction, necessitating massive language data to reconstruct statistical patterns and world knowledge implicitly.
- joint-embedding-predictive-architecture predicts within abstract embedding spaces, avoiding token-level reconstruction and reducing reliance on exhaustive language data while targeting direct world-model learning.
- yann-lecun posits JEPA as a superior path beyond LLMs, arguing that reasoning capabilities scale better via representation-space prediction than via autoregressive text generation.
- Details: Yann LeCun’s JEPA Proposal: A Path Beyond LLMs.