The basic idea
How would you make a computer "understand" that "dog" and "puppy" are related, but "dog" and "thermostat" are not?
You can't program the rules; there are too many words, too many relationships, too much context. What you can do: represent each word as a list of numbers in such a way that similar words have similar lists. Then "similarity" becomes a matter of comparing the lists numerically.
This is what an embedding is. Each input — a word, a sentence, a product, an image — becomes a vector of numbers. The vectors are designed (by training a neural network) so that semantically related inputs end up near each other in the vector space.
Embeddings are the foundation of nearly every modern AI system that handles language, recommendations, search, or retrieval.
A concrete example
A typical word embedding might assign each word a 768-dimensional vector (a list of 768 numbers). After training:
- "dog" might be
[0.21, -0.55, 0.78, ..., 0.04] - "puppy" might be
[0.19, -0.51, 0.81, ..., 0.06] - "thermostat" might be
[-0.83, 0.42, -0.11, ..., 0.97]
"dog" and "puppy" have very similar vectors (close in the 768-dimensional space). "thermostat" has a very different vector.
No individual dimension is interpretable. You can't say "dimension 5 is dogness" — the meaning is distributed across all dimensions. But the geometry of the space encodes meaning.
You can do this for sentences, paragraphs, whole documents, images, audio clips, or anything else. The fixed-length vector is the universal representation.
How they're trained
Different embedding models train differently, but the canonical approach (illustrated by word2vec, 2013) is:
- Pick a word in a sentence; call it the "target."
- Look at words near it in the sentence; call them "context."
- Train a neural network to predict the context from the target (or vice versa).
- After training, the network's internal representations of each word are the embeddings.
The intuition: words used in similar contexts probably mean similar things. "Dog" and "puppy" appear in lots of overlapping contexts ("the ___ chased the ball"); "dog" and "thermostat" don't.
Modern embeddings (like OpenAI's text-embedding-3 series, or sentence-transformers) use much more sophisticated training — usually based on transformer models (see the transformers article) — but the principle is the same: representations are learned so that semantic similarity = geometric proximity.
The king-queen analogy
In the early days of word embeddings, researchers found a remarkable property: you could do arithmetic on the vectors.
embedding("king") − embedding("man") + embedding("woman") ≈ embedding("queen")
The vectors encoded relationships geometrically. "King is to man as queen is to woman" became a vector operation that could be verified numerically.
Other examples:
- Paris − France + Italy ≈ Rome (capitals)
- Walked − walk + drive ≈ drove (verb tenses)
- Better − good + bad ≈ worse (comparatives)
Modern sentence-level embeddings don't have this property as cleanly, but they capture much more nuance. The word2vec analogies became a folklore example for explaining what embeddings do.
What embeddings are used for
Semantic search. Instead of matching keywords, compare query embeddings to document embeddings. A query like "how to fix a leaky pipe" matches documents about plumbing even if they don't contain those exact words. This is what Google's modern search (and most "AI search" tools) does under the hood.
Recommendation systems. Product embeddings let you find "similar items" by comparing vectors. User embeddings let you predict what they'd like. Netflix, Amazon, Spotify all use embedding-based recommendation at scale.
RAG (retrieval-augmented generation). An LLM is given a question; the system embeds the question and finds the most-similar passages from a knowledge base; the passages are fed to the LLM as context. The model answers using the retrieved information. This dramatically reduces hallucination (see the hallucination article).
Clustering and analysis. Embeddings let you group similar customers, similar bug reports, similar academic papers. Vast amounts of "what are the natural groups in this data?" analysis use embeddings.
Anomaly detection. Items whose embeddings are far from any cluster are outliers. Useful for fraud detection, content moderation, quality control.
Multimodal AI. Modern embedding systems can put text and images in the same space — text("a red car") and image(actual photo of red car) end up near each other. This is what powers image search by description, automatic photo tagging, and the visual capabilities of multimodal models.
How similarity is measured
The standard measure: cosine similarity — the cosine of the angle between two vectors.
- 1.0: vectors point the same direction (very similar)
- 0.0: perpendicular (unrelated)
- −1.0: opposite (semantically opposite)
For practical use, vectors are usually normalized to unit length, so cosine similarity reduces to the dot product (multiply corresponding elements and sum). This is fast and easy to compute even for million-vector databases.
You can also use Euclidean distance (straight-line distance) or other metrics, but cosine similarity is by far the most common because it captures direction (i.e., what direction the vector points) more than magnitude (i.e., how long it is).
Vector databases
A new category of database has emerged specifically for storing and searching embeddings: vector databases. Examples: Pinecone, Weaviate, Milvus, Qdrant. Also features in major databases (PostgreSQL pgvector, Redis vector similarity).
These databases let you store millions to billions of vectors and find "nearest neighbors" to a query vector very fast — typically a few milliseconds per query, using approximate algorithms.
This is the backbone of modern AI infrastructure. Every chatbot with company-specific knowledge, every "ask my documents" tool, every "find similar content" feature relies on a vector database somewhere.
Where embeddings get weird
A few non-obvious things:
Embedding spaces aren't perfectly interpretable. You can't directly tell what dimension 47 means. Some research has tried to identify interpretable axes (sentiment, formality, abstractness), but mostly the dimensions are entangled.
Bias. Embeddings reflect the biases of their training data. "Doctor" and "he" were closer in early word2vec embeddings than "doctor" and "she." Modern embedding models have been actively debiased, but not perfectly.
Different models give different embeddings. OpenAI's embeddings and Google's embeddings encode the same word as different vectors. You can't mix them. You have to stick with one model throughout a system.
Distance can be misleading. Two sentences with the same words but very different meaning ("the cat chased the dog" vs "the dog chased the cat") can have similar embeddings, because the words are the same. Modern transformer-based embeddings handle this better than older bag-of-words approaches, but it's still a limitation.
Want a guided 5-minute course on embeddings and how they power modern AI? NerdSip can generate one.
The takeaway
An embedding is a vector of numbers representing some input — a word, sentence, image, anything — designed so that similar inputs have similar vectors. This turns "how related are these things?" into a numerical comparison computers can do at scale. Embeddings power semantic search, recommendation, retrieval, clustering, multimodal AI, and most of what modern AI systems do under the hood. Once you understand them, a lot of "how does AI know X is similar to Y?" makes sense.