Text Embeddings

Text embeddings are numerical representations of discrete objects (words, sentences, documents) in a continuous vector space. They map high-dimensional, sparse data into low-dimensional, dense vectors, preserving semantic relationships.

Core Concepts

  • Semantic Similarity: Vectors with similar meanings are closer in Euclidean or Cosine distance.
  • Dense vs. Sparse: Embeddings are dense vectors, unlike traditional Bag-of-Words or TF-IDF which are sparse.
  • Dimensionality: Typical dimensions range from 128 to 1536, balancing granularity and computational cost.

Applications

  • Search & Retrieval: Semantic search surpasses keyword matching by understanding intent.
  • Clustering: Grouping similar documents or topics automatically.
  • Recommendation Systems: Matching user preferences with item attributes via vector proximity.
  • Input for LLMs: Often used as the first step in RAG (Retrieval-Augmented Generation) pipelines.

See Also

  • Cosine Similarity
  • Word2Vec
  • bert
  • High-Dimensional Space