Neptune Blog

What are LLM Embeddings: All you Need to Know

9 min
6th November, 2025

TL;DR

LLM embeddings are the numerical, vector representations of text that Large Language Models (LLMs) use to process information.

Unlike their predecessor word embeddings, LLM embeddings are context-aware and dynamically change to capture semantic and syntactic relationships based on the surrounding text.

Positional encoding, like Rotary Positional Encoding (RoPE), is a key component that gives these embeddings a sense of word order, allowing LLMs to process long sequences of text effectively.

Applications of embeddings beyond LLMs include semantic search, text similarity, and Retrieval-Augmented Generation (RAG), with the latter combining an LLM with an external knowledge base to produce more accurate and grounded responses.

Embeddings are a numerical representation of text. They are fundamental to the transformer architecture and, thus, all Large Language Models (LLMs).

In a nutshell, the embedding layer in an LLM converts the input tokens into high-dimensional vector representations. Then, positional encoding is applied, and the resulting embedding vectors are passed on to the transformer blocks.

LLM embeddings are trained in a self-supervised manner alongside the entire model. Their value depends not only on an individual token but is influenced by the surrounding text. Furthermore, they can also be multimodal, enabling an LLM to process other data modalities, such as images. A multimodal LLM can, for example, take a photo as input and produce a textual description.

In this article, we’ll explore this core building block of LLMs and answer questions such as:

  • How do embeddings work?
  • What is the role of the embedding layer in LLMs?
  • What are the applications of LLM embeddings?
  • How can we select the most suitable LLM embedding models?

How do embeddings work, and what are they used for?

The LLM inference pipeline begins with raw text being passed to a tokenizer. The tokenizer is a component separate from the LLM that converts the text into tokens. Since the introduction of models like Google’s PaLM (2022) and OpenAI’s GPT-4 (2023), most LLMs employ methods like subword tokenization (e.g., through the SentencePiece algorithm) that can handle new words not seen during training. The tokens are fed into the LLM’s embedding layer, which transforms them into vectors for the transformer blocks to process.

The size of these vectors, known as the embedding dimension, is a key hyperparameter that significantly impacts an LLM’s capacity and computational cost. Embedding dimensions vary widely across models. For example, the smaller Llama 3 8B model (2024) uses a 4096-dimensional embedding, while the larger DeepSeek-R1 (2024) model uses 7168-dimensional embeddings. Generally, models with larger embedding dimensions have a higher capacity to store information, but they also require more memory and compute for training and inference.

A typical decoder-only LLM is structured like this (source):

gpt decoder diagram

Following the Transformer architecture, the embeddings are fed into the multi-head attention layers, where the model processes context. Attention in LLMs measures the importance of each word in relation to every other word in the same sequence. This enables the model to extract information directly from the text.

Absolute positional encoding

At this stage, embeddings lack order, meaning a shuffled sentence would convey the same information as the original. This is because the computed vectors encode only tokens, not their positions. The next component in the diagram, Positional Encoding, resolves this issue. 

The original Transformer architecture used Absolute Positional Encoding (APE) to impose a sequence order. It achieved this by adding a unique vector to the token’s embedding at each position. This unique vector was generated using a combination of sine and cosine waves, where different dimensions of the embedding vectors correspond to different wavelengths. Specifically, the i-th element of the positional vector at position pos was calculated using the following formulas:

formula

Here, dmodel is the embedding dimension. By using these formulas, every position receives a unique, smooth, and deterministic positional signal, effectively informing the model of the token’s location and solving the problem of positionless vectors.

This method, however, limited the LLM’s ability to handle texts longer than its training data. This limitation arises because the model is only trained on positions up to a fixed maximum length, the so-called context window. Since APE uses a fixed, absolute formula for each position, the model cannot generalize to positions beyond this maximum length, forcing a hard limit on the input sequence size. 

absolute positonal encoding
Absolute Positional Encoding. The value of sine and cosine waves of varying frequencies over the token position t is added to the embedding vector, with higher frequencies for earlier dimensions and lower frequencies for later dimensions. The x-axis shows the positions from t=0 to t=512 representing the model’s context window. | Source

Relative positional encoding

Rotary Positional Encoding (RoPE) was introduced in April 2021 by Jianlin Su et al. to address this problem and is a widely adopted method in LLMs like LLaMa-3 and DeepSeek-R1 for positional encoding. 

RoPE works by encoding the distance between tokens through a rotation applied directly to the embedding vectors before they enter the attention mechanism. It rotates a token’s embedding vector by a multiple of a fixed angle that is determined by the token’s absolute position.

The insight of RoPE is that this rotation is applied in such a way that it integrates seamlessly into the self-attention layer, ensuring the interaction between two words remains consistent, regardless of where the pair appears in a sequence. Mathematically, this means the dot product of the rotated query and key vectors (QK) inherently depends only on the relative distance between the two tokens, not their absolute positions.

rotary positional embedding visualization
The effect of Rotary Positional Embedding (RoPE) on the token embeddings for the sequence “We are dancing.” The light blue circles represent the initial embeddings before RoPE is applied, with each token pointing in a distinct direction from the origin. After RoPE is applied, the green circles show that each token’s embedding has been rotated by an angle proportional to its position in the sequence, specifically by for “we”, for “are”, and for “dancing”. In this particular example, θ=45°. | Source

In addition to being able to handle longer sequences, RoPE also contributes to better perplexity for long texts compared to other methods. Perplexity measures how effectively a language model predicts the next word in a text. A lower perplexity score indicates that the model is less surprised by the actual next word, leading to more coherent and accurate predictions. RoPE’s ability to maintain consistent word relationships based only on their relative distance over extended sequences allows models to achieve this lower perplexity, as the quality of word prediction is maintained even when dealing with very long contexts.

Comparison of the perplexity of an LLM against the sequence length
Comparison of the perplexity of an LLM against the sequence length it processes, contrasting two different positional encoding methods: Absolute Positional Encoding (red line) and RoPE (blue line). APE, used in the original Transformer, shows that perplexity remains relatively low and stable until the sequence length slightly exceeds the training sequence length (indicated by the yellow dashed line at 512 after which it dramatically increases. In contrast, the RoPE method demonstrates superior extrapolation capability, with perplexity increasing much more gracefully as the sequence length extends well beyond the training length, showcasing its ability to handle significantly longer contexts. | Source 

A brief history of embeddings in NLP

Understanding the history of embeddings in NLP provides context for appreciating the advancements and limitations of LLM embeddings, showing the progression from simple one-hot encoding to sophisticated techniques like Word2Vec, BERT, and LLMs. The entire idea of embeddings is rooted in the Distributional Hypothesis, which states that words that appear in similar contexts have similar meanings.

In the field of natural language processing (NLP), there has always been a need to transform words into vector representations for processing text. Almost every embedding technique relies on a large amount of text data to extract the relationships between words.

First, embedding methods relied on statistical approaches that utilized the co-occurrence of words within a text. These methods are simple and computationally inexpensive, but they do not provide a thorough understanding of the data.

Sparse word embeddings

In the early days of Natural Language Processing (NLP), beginning around the 1970s, the first and most straightforward method for encoding words was one-hot encoding. Each word was represented as a vector with a dimension equal to the total vocabulary size. Only one dimension was set to 1 (the “hot” dimension) while all others were set to 0. Due to this construction, one-hot encoding had two major drawbacks. The first one is that for a large vocabulary, the resulting vectors are extremely long and mostly zeroes, making them computationally inefficient for storage and processing. And the second is that the vectors lack a measure of similarity between words because the vectors are always perpendicular to each other.

In the 1980s, count-based methods were developed, such as TF-IDF and word co-occurrence matrices. They attempt to capture semantic relationships based on frequency and co-occurrence. They assume that if words frequently appear together, they share a closer relationship.

Word Embeddings


Sparse Word Embeddings
One-Hot Vectors1970s
TF-IDF
1980s
Co-Occurrence Matrix

Static Word Embeddings
Word2Vec2013
GloVe2014





Contextualized word embeddings
ELMo2018
GPT-12018
BERT2018
LLAMA2023
DeepSeek-V12023
GPT-42023

Static word embeddings

Static word embeddings, such as word2vec in 2013, marked a significant development. The paradigm shift was that words could be automatically converted into dense and low-dimensional representations, achieved using gradient descent. Their ability to capture semantic and syntactic relationships within text was a key advantage, providing more value than previous methods.

Their limitation was that they only retained the context of the training corpus, meaning that they provided a fixed and precise representation of the tokens, regardless of the new input context. E.g., they couldn’t differentiate the word “capital” in “capital of France” and “raising capital”. To achieve this, a mechanism was needed to transform static embeddings based on surrounding words.

Contextualized word embeddings

In 2017, the Transformer architecture was introduced through the paper “Attention Is All You Need,” which changed how embeddings were encoded. 

Bidirectional Encoder Representations from Transformers (BERT) is considered the first contextual language model. Launched in 2018, BERT utilizes the encoder component of the Transformer architecture to process an entire input sequence simultaneously. This design allows it to generate dynamic, context-aware embeddings for every token. These rich embeddings proved highly effective for Natural Language Understanding tasks, such as text classification. It significantly advanced the concept of transfer learning in NLP by allowing the pre-trained model to be fine-tuned for various downstream tasks.

The embedding layer within the LLM architecture

There are three core components to LLM architectures related to embeddings to distinguish between:

  1. Embedding (the vector): Is the numerical representation of a piece of data, like a token, word, sentence, or image. It is the output of the embedding layer and the input to the Transformer blocks.
  2. Embedding Layer (the component): Is the learnable input component of the LLM that converts discrete tokens into initial dense vectors. It contains the embedding vectors.
  3. Embedding Model (the system): A complete neural network, often a small Transformer or a simple model like Word2Vec, whose sole purpose is to generate embeddings that are typically used for tasks like semantic search.

In an LLM like GPT-4, the embedding layer is the first component that the tokenized input interacts with. It functions as a lookup table or a weight matrix. When an input token ID arrives, the layer simply looks up the corresponding row and outputs that vector. This process transforms the high-dimensional token ID into a low-dimensional, meaningful initial embedding vector.

The embedding layer’s weight matrix is fully learnable. When training from scratch, it is randomly initialized and trained in tandem with all other weights, like the attention mechanism and feed-forward networks, in a self-supervised manner. In comparison with static, non-contextual methods of the past, the embedding layer learns to place semantically similar tokens closer together in the vector space.

Advanced embedding applications and optimizations

Along with the advances in embedding layers, the generation of embeddings for particular purposes has evolved as well.

Sentence embeddings

While an LLM’s primary input consists of individual token embeddings that become contextualized by the Transformer blocks, the field is evolving to represent larger chunks of meaning efficiently. Some approaches, like SONAR, aim to generate sentence embeddings, where a single vector captures the meaning of an entire sentence or a complete concept. This is useful for tasks like semantic search or retrieval-augmented generation (RAG), where you need to find relevant documents or passages quickly.

Meta and other research groups are actively exploring these advanced encoding methods. The goal is to move beyond word-level understanding to comprehending entire ideas and relationships across longer texts, creating more powerful and efficient language models. Models like Sentence-BERT, which was the first model to successfully create high-quality, fixed-size sentence embeddings for tasks like semantic search and clustering. Then, other sentence embedding models followed, like EmbeddingGemma.

Specialized embedding spaces

Embeddings fine-tuned on domain-specific data can offer performance benefits over general-purpose LLM embeddings. Examples of effective transfer learning models on extensive domain-specific text include ClinicalBERT, SciBERT, and LegalBERT. These models are BERT-based architectures where the final output layer serves as the specialized embedding representation, which can be used directly for tasks like similarity search or classification.

This fine-tuning approach is distinct from the initial, general-purpose embedding layers inherent to LLMs. Furthermore, models like Mistral-7B-Instruct-v0.2 have been explicitly fine-tuned for instruction following and general question answering, which makes it exceptionally good for the generation step within a RAG pipeline.

Embedding caching

Embedding compression and caching reduce the embedding vector size while keeping its information. This allows LLMs to deploy on devices with limited memory or for quicker inference. Recently, Google released Gemma 3N, a mobile-first open-weight large language model using Per-Layer Embeddings (PLE), a novel technique for optimizing the use of computational resources.

Traditionally, LLMs generate a single embedding for each token at the input layer, which then passes through all subsequent layers. This means that the entire embedding table, which can be large, must remain in active memory throughout the inference process.

With PLE, smaller and more specific per-layer embedding vectors are generated during inference for particular layers of the transformer network, rather than using one large initial vector. These specific, smaller vectors are then cached to slower storage, like a mobile device’s flash memory, and loaded into the model’s inference process as the corresponding layer runs.

This method optimizes memory by not requiring the full embedding table weights or the large initial token embedding vector to be continuously held in active memory. This allows them to generate and store these per-layer embeddings separately from the main model’s memory. They can be cached to external storage, like mobile device flash memory, and then loaded and integrated into the model’s inference process as each layer runs.

Applications of LLM embeddings

The versatility of embeddings makes them useful for various applications, most of which make use of an embedding model’s ability to compress the semantics of a textual input into a small vector.

Text similarity

Embeddings represent the meaning of text in a numerical vector space. The closer the two embedding vectors are in this space, the more similar their meaning. This vector proximity directly reflects their shared semantic meaning. Here, encoder-only models such as BERT or OpenAI embeddings are often a good choice. They are specifically trained to produce embeddings where semantic similarity translates directly to vector proximity using cosine similarity. Compared to general-purpose LLMs, they are relatively small and thus efficient and cost-effective.

As of October 2025, Qwen3-Embedding ranks highly in the Massive Text Embedding Benchmark (MTEB). The following code snippet demonstrates the context-aware capabilities of Qwen3-Embedding-4B, an open-source encoder-only model that considers the entire context of sentences, not just word-level similarity.

The following example uses Sentence Transformers, the primary Python library for working with state-of-the-art embedding models. It allows you to compute embeddings and similarity scores using sentence transformer, similar to SONAR. This facilitates applications like semantic search and semantic textual similarity. The library provides immediate access to over 10,000 pre-trained models on Hugging Face.

# need transformers>=4.51.0 sentence-transformers>=2.7.0
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-Embedding-4B")
texts_to_compare = ["Oh, that was a brilliant idea! (after something went wrong)", "That was a truly brilliant performance.", "That was a terrible idea."]
sentences_embeddings = model.encode(texts_to_compare)
similarity = model.similarity(sentences_embeddings, sentences_embeddings)
print(f'{texts_to_compare[0]} <- {int(similarity[0, 1]*100)}% -> {texts_to_compare[1]}')
print(f'{texts_to_compare[0]} <- {int(similarity[0, 2]*100)}% -> {texts_to_compare[2]}')

This generates the following output:

Oh, that was a brilliant idea! (after something went wrong) That was a truly brilliant performance.

Oh, that was a brilliant idea! (after something went wrong) That was a terrible idea.

 Semantic search

Instead of keyword matching, semantic search interprets a user’s query and identifies semantically similar documents, even if there are no exact keyword matches. It works by preprocessing documents, including webpages or images, and converting them into embeddings using a model like Qwen3-Embedding or a vision model like OpenAI’s CLIP ViT. Then, these embeddings are typically stored in a vector database, such as Pinecone or PostgreSQL with pgvector extension.

When a user submits a search query, the query is converted into an embedding using the exact text embedding model. It is then compared against all the document embeddings in the vector database using cosine similarity. Finally, the documents with the highest similarity scores are retrieved and presented to the user as search results.

RAG

Retrieval-Augmented Generation (RAG) combines LLMs to generate accurate, current, and grounded responses by fetching relevant information from an external knowledge base.

When a user submits a prompt to the LLM, it is first embedded using one of the previously mentioned encoder models. A semantic search then runs against an external knowledge base. This knowledge base typically holds documents or text chunks, processed into multimodal embeddings and stored in a vector database. The most similar documents or text paragraphs are retrieved, serving as context for the prompt. This means they are added as input to an LLM (like GPT-4, Llama, or DeepSeek), where the final prompt includes both the original user query and the retrieved information.

The LLM then uses this combined input to generate a response. The input prompt, augmented with retrieved information, reduces hallucination and allows the LLM to answer questions about specific, current knowledge it may not have been trained on.

RAG architecture
The Retrieval-Augmented Generation (RAG) architecture. A user’s prompt first goes into a middleware, which initiates a semantic search against a vector database containing documents that have been encoded as embedding vectors. The retrieved contextual data is combined with the original prompt to create an augmented prompt, which is then used by the LLM (represented by the brain icon) to generate an enriched response for the user.

How do you select the most suitable LLM embedding models?

Since applications, data, and computational capabilities vary, you need resources to choose the right tool. First, some LLM benchmarks for overall capabilities:

  • Massive Multitask Language Understanding (MMLU) is a benchmark that evaluates an LLM’s knowledge and reasoning across 57 subjects, including science, mathematics, humanities, and social sciences. It evaluates a model’s overall understanding and ability to perform across multiple domains.
  • HellaSwag tests an LLM’s common-sense reasoning by requiring it to complete a sentence from options that are designed to be easy for humans but hard for models. This assesses their ability to understand implicit knowledge and everyday situations.
  • TruthfulQA evaluates an LLM’s tendency to generate truthful answers, which is important for assessing a model’s reliability in combating misinformation and producing accurate content.

There is also a number of benchmarks specifically designed for LLM text embeddings:

  • Massive Text Embedding Benchmark (MTEB) is a comprehensive and recognized benchmark for text embeddings. It is a suite of tasks with hundreds of embedding models. It evaluates their quality across various datasets and multiple tasks, such as classification, retrieval, semantic textual similarity, and summarization.
  • Benchmarking Information Retrieval (BEIR) is a benchmark for semantic search, RAG, or document retrieval, offering datasets for assessing how embedding models, like Sentence-BERT, capture search relevance.

Multimodal embeddings are important, but their benchmarks are not as consolidated. Nevertheless, there are still some to highlight:

  • Microsoft Common Objects in Context (MS-COCO) is a vision benchmark that includes tasks such as image captioning, object detection, visual question answering, and object segmentation. These are important for evaluating tasks where models need to understand visual content and relate it to textual descriptions.
  • LibriSpeech is a large corpus of read English speech, primarily used for automatic speech recognition, which converts speech to text. Models trained on LibriSpeech learn to extract phonetic and linguistic features from audio, which can be understood as audio embeddings for speech recognition.

When selecting LLM embeddings, consider filtering by benchmark performance and these three features:

  • The number of parameters in an embedding model directly affects its memory usage. Qwen3-Embedding-4B, used earlier, requires nearly 8GB of memory to operate on either the CPU or GPU. This is a significant limiting factor for LLM execution.
  • Embedding dimensionality is the number of dimensions into which a token is expanded before being fed into an LLM. Higher dimensionality can capture more nuance, but it also increases memory and computation requirements. DeepSeek-R1 expands each token into 7,168-size embeddings, while Llama 3 70B uses 8192 dimensions.
  • Context length refers to the maximum number of tokens that the model can consider when generating a response or understanding an input. If a text exceeds this limit, the model forgets the earlier parts of the input. Ideally, LLMs should have as large a context length as possible, but that comes at the expense of increased memory utilization. Self-attention memory requirements grow with the square of the input size, making processing huge text corpora prohibitively expensive.

    Early models, such as BERT, had a context window of around 512 tokens, which was a significant improvement at the time but limited their ability to handle long documents. GPT-3 and Llama used 2048 tokens as their standard context length. GPT-4 gradually increased it to 8192 tokens (8K), 32K, and 128 K. Gemini 1.5 Pro reached a 1 million context window.

Final thoughts and conclusion

LLM embeddings convert text, images, and other data into numbers that neural networks use. These word vector embeddings are key to language model functions, influencing how they process information and their various applications. They assist AI in understanding context, locating similar information, and even translating languages.

We discussed how embeddings function within LLM architectures, including positional encoding techniques such as ROPE, which allow models to handle longer texts. We also examined their applications in areas such as text similarity, word sense disambiguation, semantic search, and Retrieval-Augmented Generation (RAG).

Choosing the proper LLM embedding involves considering benchmarks, model size, embedding dimensions, and context length. Tools like Hugging Face Hub, Ollama, and Sentence Transformers simplify the process of finding, building, and using these embeddings. Unsloth AI helps fine-tune models for specific needs, making them more efficient.

Was the article useful?

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.