TL;DR
Grounding augments the pre-trained knowledge of Large Language Models (LLMs) by providing relevant external knowledge along with the task prompt.
Retrieval-augmented generation (RAG), which builds decades-long work in information retrieval, is the leading grounding strategy.
The key challenges in LLM grounding revolve around data. It has to be relevant to the task, available in the right quantity, and prepared in the right format.
When providing information to an LLM, less is more. Research shows that it is optimal to provide as little as necessary for the LLM to infer the relevant information.
Large Language Models (LLMs) can be thought of as knowledge bases. During training, LLMs observe large amounts of text. Through this process, they encode a substantial amount of general knowledge that is drawn upon when generating output. This ability to reproduce knowledge is a key driver in enabling capabilities like question-answering or summarization.
However, there will always be limits to the “knowledge” encoded in an LLM. Some information simply won’t appear in an LLM’s training data and may therefore be unknown to the LLM. For example, this could include private or personal information (e.g., an individual’s health records), domain-specific knowledge, or information that did not exist at the time of training.
Likewise, since LLMs have a finite number of trainable parameters, they can only store a certain amount of information. Therefore, even if knowledge appears in the training data, there is little guarantee as to whether (or how) it will be recalled.
Many LLM applications require relevant and up-to-date data. Despite best efforts in training data curation and ever-growing model capacity, there will always be situations in which LLMs exhibit knowledge gaps. However, their pre-trained knowledge can be augmented at inference time. By providing additional information directly to an LLM, users can “ground” LLM responses in new knowledge while still leveraging pre-trained knowledge.
In this article, we’ll explore the fundamental concepts of LLM grounding as well as strategies for optimally grounding models.
What is LLM grounding?
Most people are familiar with the concept of grounding, whether knowingly or not. When solving problems or answering questions, we rely on our previous experience and memorized knowledge. In these situations, one might say that our actions are grounded in our previous experiences and knowledge.
However, when faced with unfamiliar tasks or questions for which we are unsure, we must fill our knowledge gaps in real time by finding and learning from relevant information. In these situations, we could say that our actions are “grounded” in this supplementary information.
Of course, our intrinsic knowledge plays a critical role in interpreting and contextualizing new information. But in situations where we reach for external knowledge, our response is grounded primarily in this newly acquired information, as it provides the relevant and missing context critical to the solution. This aligns with ideas from cognitive psychology, particularly theories of situated cognition, which argue that knowledge is situated in the environment in which it was learned.
LLM grounding is analogous. LLMs rely on vast general knowledge to perform generic tasks and answer common questions. When faced with specialized tasks or questions for which there is a gap in their knowledge, LLMs must use external supplementary information.
A strict definition of LLM grounding given by Lee and colleagues in 2024 requires that, given some contextual information, the LLM uses all essential knowledge from this context and adheres to its scope, without hallucinating any information.
In day-to-day use, the term “LLM grounding” can refer to only the process of providing information to an LLM (e.g., as a synonym for retrieval-augmented generation) or the process of interpreting said information (e.g., contextual understanding). In this article, we will use the term “grounding” to refer to both, but forgo any strict guarantees on the output of the LLM.
Why do we ground LLMs?
Suppose we pose a question to an LLM that cannot be answered correctly using only its pre-trained knowledge. Despite the lack of sufficient supplementary knowledge, LLMs will still respond. Although it may indicate that it cannot infer the correct answer, it could also respond with an incorrect answer as a “best guess.” This tendency of LLMs to generate outputs containing information that sounds plausible but is factually incorrect is known as hallucination.
LLMs are designed simply to predict tokens given previously predicted tokens (and their inherent knowledge), and have no understanding of the extent of their own knowledge. By seeding relevant external information as “previous” tokens, we introduce more knowledge for the LLM may draw upon, and thus reduce the likelihood of hallucination. (You can find a more thorough discussion of the underlying mechanisms in the comprehensive survey of hallucination in natural language generation published by Ji and colleagues in 2023.)
How do we ground LLMs?
In-context learning (ICL) is an emergent capability of LLMs. ICL allows LLMs to incorporate arbitrary contextual information provided in the input prompt at inference time. A notable application of ICL is few-shot learning, where an LLM infers how to perform a task by considering input-output example pairs included in the prompt.
With the advent of larger LLM systems, ICL has been expanded into a formal grounding technique known as retrieval-augmented generation (RAG). In RAG, ICL is leveraged to integrate specific information relevant to a task at hand, retrieved from some external information source.
This information source typically takes the form of a vector database or search engine (i.e., an index of web pages) and is queried by a so-called retriever. For unimodal LLMs whose input is strictly textual, these databases store text documents, a subset of which will be returned by the retriever.
The LLM’s input prompt must combine the task instructions and the retrieved supplementary information. When engineering a RAG prompt, we should therefore consider to:
- Summarize or omit parts of the retrieved information.
- Reorder retrieved information and/or the instructions.
- Include metadata (e.g., hyperlinks, authors).
- Reformat the information.
This is what a simple RAG prompt might look like:
Use the following documents to answer the following question.
[Question]
What is the capital city of Canada?
[Document 1]
Ottawa is the capital city of Canada. ...
[Document 2]
Canada is a country in North America. ...
Let’s consider a specific example: Suppose we wish to build an LLM application like Google Gemini or Microsoft Copilot. These systems can retrieve information from a web search engine like Google and provide it to an LLM.
A typical implementation of such a system will comprise three core steps:
- Query transformation: When a user submits a prompt to the RAG system, an LLM infers retriever search queries from the prompt. The queries collectively seek all web pages relevant to the task described in the prompt.
- Retrieve information: The queries are passed to and executed by a search engine (e.g., each user query is executed by the search engine), which produces rankings of web page search results.
- Provide data to LLM: The top ten results returned for each query are concatenated into a prompt for the LLM, enabling the LLM to ground its answer in the most relevant content.
Core strategies for optimally grounding LLMs
LLM grounding is not always as simple as retrieving data and providing it to an LLM. The main challenge is procuring and preparing relevant data.
Data relevance
LLM grounding reconfigures the problem of conceiving an answer into a problem of summarizing (or inferring) an answer from provided data. If relevant knowledge cannot be inferred from the data, then LLM grounding cannot yield more relevant responses. Thus, a critical challenge is ensuring that the information we are grounding LLMs on is high-quality and relevant.
Independent of LLMs, identifying data relevant to user queries is difficult. Beyond the issues of query ambiguity and data quality, there is the deeper challenge of interpreting query intent, inferring the underlying information need, and retrieving information that answers it. This difficulty underpins and motivates the entire field of information retrieval. Grounded LLMs inherit this difficulty directly, as response quality depends on retrieval quality.
Given these challenges, practitioners must design prompts and retrieval strategies to ensure relevance. To minimize ambiguity, user input should be limited to only what is necessary and incorporated into a structured prompt.
Search engines, indexes, or APIs can be used to obtain high-quality data relevant to the task at hand. Web search engines provide access to broad and up-to-date information. When building a custom retrieval system for an index or database, consider building a two-stage pipeline with both a retriever (to build a shortlist of relevant documents using simple keyword matching) and a ranker (to re-rank shortlisted documents with advanced reasoning).
For a retriever, basic term-statistic methods (e.g., TF-IDF, BM25) are widely preferred for their efficiency. However, rankers typically leverage “neural” architectures (often based on the transformer architecture proposed by Vaswani and colleagues in 2017) to detect semantic relevance. Regardless of the method, the usefulness of retrieved data depends greatly on the queries posed to retrievers and how well they capture the issuer’s intent. Consider designing and testing queries explicitly for the task at hand, or using an LLM for dynamic query refinement.
Data quantity
Another threat to the effectiveness of grounding LLMs lies in the amount of information provided to them. Although LLMs are technically capable of ingesting vast amounts of input (LLMs like Llama 4 “Scout” have enough input tokens to ingest entire books), their effectiveness can vary based on exactly how much input is provided.
Empirically, LLM performance typically degrades with increasing input size, especially when measured on reasoning or summarization-centric tasks. Intuitively, a simple strategy to mitigate this issue is to minimize the input size, namely by minimizing the amount of external data provided. In other words, “less is more”: provide enough information for the LLM to ground its response, but no more.
When grounding LLMs using RAG, consider retaining only a few of the top hits (i.e., top-k) for your retrieval queries. The ideal value for k will vary based on many factors, including the choice of retriever, the indexed data being retrieved, and the task at hand. To establish an appropriate value, consider running experiments across different values of k and then finding the smallest value that retrieves sufficient information. The ideal value of k could vary in different situations; if these situations are distinguishable, consider designing an algorithm to set k dynamically.
When given the option, consider working at finer granularities of text (e.g., prefer sentences or small chunks over paragraphs or documents). In keeping with “less is more,” endeavor to retrieve the text of the smallest granularity that (when combined with other hits) is sufficiently informative. When retrieving text at larger granularities (e.g., documents), consider extracting key sentences from retrieved documents.
Where does FM training data come from, and what role does it play?
With the advent of deep learning and increased compute and memory capacity, machine-learning datasets became significantly larger. ImageNet-1K, the most popular edition of the widely used ImageNet dataset, contains 1.2 million images totalling 170 GB (about 140 KB per image).
Foundation models have brought yet another shift. The datasets are orders of magnitude bigger, the individual samples are larger, and the data is less clean. The effort that was previously spent on selecting and compressing samples is now devoted to accumulating vast datasets.
With the change in data sources used, the role of domain experts in the model training process evolved as well. Traditionally, they were involved in curating and annotating data ahead of training. In foundation model training, their core responsibility is to evaluate the models’ performance on downstream tasks.
-
Read more about the role of data in foundation model training and other topics in Neptune’s 2025 State of Foundation Model Training Report.
Data arrangement
In addition to relevance and quantity, the relative position (order) of data can significantly influence the response generation process. Research published by Liu and colleagues in 2024 shows that the ability of many LLMs to find and use information in their input context depends on the relative position of that information.
LLM performance is generally higher when relevant information is placed near the beginning or end of the input context and lower when placed in the middle. This so-called “lost in the middle” bias suggests that LLMs tend to “skim” when reading large amounts of text, and the resulting performance degradation worsens as the input context grows.
Mitigating “lost in the middle” bias can be difficult since it is difficult to anticipate which retrieved information (e.g., which retrieved documents) contains the context truly critical for grounding. Generally, “less is more” applies here, too. By minimizing the amount of information provided to the LLM, we can lessen the effect of this bias.
The “lost in the middle” bias can be measured empirically using tests like Greg Kamradt’s “Needle in the Haystack Test,” which enables LLM developers to optimize for robustness to this bias. To adjust for an LLM that exhibits this bias, consider sampling answers from multiple similar inference calls, each time shuffling (or even strategically dropping) external information. Alternatively, you could estimate the importance of different information and then rearrange it to place important information in preferred locations.
Open challenges and ongoing research in LLM grounding
Grounding is an indispensable strategy for improving the performance of LLMs. Particularly when using retrieval-augmented generation, the extent of these improvements often hinges on secondary factors like the amount of external data and its exact arrangement. These difficulties are the focus of ongoing research, which will continue to marginalize their effect.
Another focus of research in LLM grounding is improving provenance, which is the ability to cite specific data sources (or parts thereof) used to generate an output. Benchmarks like Attributed QA from Google Research are tracking the progress in this area.
Researchers are also working to apply targeted modifications to update language models in place (i.e., without fine-tuning). This would enable information to be added, removed, or changed after training and could improve the coverage of pre-trained LLMs, thus reducing the need for external information.