Word Embeddings
Word embeddings are a type of word representation that allows words with similar meaning to have similar representations. They are a distributed representation for text that is perhaps one of the key breakthroughs in the history of NLP.
Core Concepts
- Numerical Representation: Converts discrete tokens (words, phrases, documents) into dense vectors of real numbers.
- Semantic Proximity: Words with similar meanings are located close to each other in the vector space Vector Space Model.
- Dimensionality Reduction: Maps high-dimensional sparse One-Hot Encoding vectors to lower-dimensional dense vectors, capturing syntactic and semantic information.
Key Algorithms & Models
- Word2Vec: Includes CBOW and Skip-gram architectures.
- GloVe: Global Vectors for Word Representation, leveraging global matrix factorization.
- Transformer-based models (e.g., bert, GPT) generate contextualized embeddings.
Applications
- Semantic Search
- Recommendation Systems
- Sentiment Analysis
- Text Classification
Integration & Notes
- See detailed overview from Thu Vu’s guide on semantic representation for NLP and AI: Vector Embeddings: Semantic Representation for NLP and AI
- Embeddings serve as foundational features for downstream ML tasks.
- Covers conversion of text units into numerical formats for model ingestion.
- Highlights the shift from discrete symbols to continuous geometric spaces.