MLOps Blog

Customizing LLM Output: Post-Processing Techniques

8 min
26th April, 2024

TL;DR:

LLMs generate output by predicting the next token based on previous ones, using a vector of logits to represent the probability of each token.

Post-processing techniques like greedy decoding, beam search, and sampling strategies (top-k, top-p) control how the next token is determined in detail, balancing between predictability and creativity.

Advanced techniques, such as frequency and presence penalties, logit bias, and structured outputs (via prompt engineering or fine-tuning), further refine LLMs’ outputs by taking into account information beyond token probabilities.

If you’ve delved into the world of large language models (LLMs) like ChatGPT, Llama, or Mistral, you’ve likely noticed how adjusting input parameters can transform the responses you get. These models are capable of delivering a wide array of outputs, from creative narratives to structured JSON. This versatility makes LLMs incredibly useful for various applications, from sparking an author’s creativity to streamlining data processing.

All of this is possible thanks to the vast amount of information encoded in LLMs and the possibility of adapting them through fine-tuning and prompting. We can further control the output of LLMs through parameters such as “temperature” or a “frequency penalty,” which influence an LLM’s output on a token-by-token basis.

Understanding these parameters and output post-processing techniques more broadly can significantly enhance the utility of LLMs for specific applications. For example, altering the temperature setting shifts the balance between variety and predictability, while adjusting the frequency penalty helps minimize repetition.

In this article, we’ll zoom in on how LLMs generate their output and how we can influence this process. Along the way, you’ll learn:

  • How LLMs generate their output using a vector of logits and the softmax function.
  • How the Greedy Decoding, Beam Search, and Sampling post-processing techniques determine the next token to output.
  • How you can balance variability and coherence with top-k sampling, top-p sampling, and by adjusting the softmax temperature.
  • How advanced techniques like frequency penalties, logit bias, and structured output give you even more control over an LLM’s output.
  • What the typical challenges are when implementing post-processing techniques in the LLM space, and how to address them.

How do LLMs generate their outputs?

Before we dive into post-processing techniques for customizing LLM outputs, it’s crucial to understand how an LLM generates its output in the first place.
Generally, when we talk about LLMs, we refer to autoregressive language models. These models predict the next token based solely on the previous tokens. However, they don’t output the token directly. Instead, they generate a vector of logits, which, after applying a softmax function, can be interpreted as the probability of each token in the vocabulary.

Illustration of the process of generating text with a large language model (LLM)
Illustration of the process of generating text with a large language model (LLM). An input sequence of tokens is processed to produce a vector of logits, which represent unnormalized probabilities for each potential next token. These logits are then converted into actual probabilities using the softmax function, determining the likelihood of each token following the input. | Source: Author

This process happens for each generation step. To generate the next token in the output, we take the sequence of tokens generated thus far, feed it into the LLM, and select the next token based on the output of the softmax function.

We can improve the output of an LLM by manipulating, refining, or constraining the model’s logits or probabilities. Collectively, we call techniques for this “post-processing techniques” – and this is what we’ll cover in the following sections.

Post-processing techniques for LLMs

Greedy decoding

At this point, you might think: “If I have the probabilities, I just need to pick the token with the highest probability, and that’s it.” If that crossed your mind, you’ve thought of Greedy Decoding. This is the simplest algorithm of all. We take the token with the highest probability as the next token and continue choosing subsequent tokens in the same way.

Let’s look at an example:

Generating an LLM’s output using greedy decoding.
Generating an LLM’s output using greedy decoding. In red, we can see the token sequence chosen by the greedy decoding algorithm. The next token is always the one with the highest relative probability. | Source: Author

In the graph above, we can see that we start with “My” as the initial token. In the first generation step, the most probable next token is “favorite,” so we choose it and feed “My favorite” into the model. The most probable token in the subsequent generation step is “color,” leaving us with “My favorite color” as the LLM’s output.

Greedy decoding is widely used when aiming for replicable results, also known as deterministic outcomes. However, despite always choosing the tokens with the highest probabilities, we don’t necessarily end up with the sequence that is most probable overall. In our example above, the sentence “My name is” has a higher cumulative probability (0.27) than “My favorite color” (0.2). Nevertheless, we chose “favorite” over “name” as it had the highest relative probability of all tokens in the first generation step.

By consistently selecting the token with the highest probability in each generation step, we sometimes miss out on tokens with higher probabilities that are “hidden” behind a token with a lower probability. We just saw this happening with the token “is” after the token “name:” Even if the token “name” is not the most probable in the first iteration, it hides a very highly probable token.

One way to address the issue of missing high-probability tokens hidden behind lower-probability tokens is to keep track of several possible next tokens for each generation. If we keep the n most probable next tokens in memory and consider them during the subsequent generation step, we significantly reduce the chance of missing a token with a high probability hidden behind a token with a lower probability. This is called beam search and is a well-known algorithm in AI research and computer science, predating LLMs by decades.
Let’s see how this would play out in our example:

Generating an LLM’s output using beam search with n=2.
Generating an LLM’s output using beam search with n=2. In red, we see the two token sequences (beams) we keep track of. The red continuous line represents the sequence with higher total probability, and the red dashed line is the second most probable sequence. Even though “name” has a longer probability than “favorite,” we select “My name is” over “My favorite color,” as the former has the highest total probability. | Source: Author

As shown in the diagram above, we keep track of the two most probable token sequences. The beam search algorithm will always find a sequence with higher or equal probability than a greedy search. However, there’s a downside to this: we’ll have to perform inference n times, once for each possible output sequence we’re keeping in memory.

It’s also important to mention that, similar to greedy search, this approach is deterministic. This can be a drawback when we aim for more varied and diverse responses.

Sampling

To introduce a bit of variety, we must turn to randomness. One of the most common ways to do this is by randomly selecting a token based on the token probability distribution in each generation step.

Bar chart illustrating output probability distribution over tokens in a single generation step.
Output probability distribution over tokens in a single generation step. | Source: Author

To give a specific example, have a look at the token probability distribution depicted above. If we apply the sampling algorithm we just sketched out, there’s a 40% likelihood of choosing the token “beer,” a 30% chance of choosing the token “orange,” and so on. However, by selecting randomly among all tokens, we risk occasionally ending up with nonsensical outputs.

Top-k sampling

To avoid choosing low-probability output sequences, we can restrict the set of tokens we sample from to the k most likely ones. This approach is called “top-k sampling” and aims to increase variety while ensuring that the output remains highly probable—and thus sensible.

In this algorithm, the idea is to select the top k tokens and redistribute the probability mass among these k tokens, i.e., tweak the probabilities of the top k tokens to ensure the sum of all probabilities remains 1. (E.g., if we select the top 2 tokens with probabilities 0.35 and 0.25, after tweaking the probabilities, the new values will be 0.58 and 0.42.)

There’s another benefit to this. Since LLMs have large vocabularies (usually tens of thousands of tokens), calculating the softmax function is often costly, as it involves computing exponentials for each input value. Therefore, selecting the top k tokens based on their logits allows us to narrow down the set over which the softmax will be applied, accelerating inference compared to naive sampling.

Generating an LLM’s output using top-k sampling with k=3
Generating an LLM’s output using top-k sampling with k=3. We sort the logits by value and calculate the next-token probabilities by applying the softmax function for only the top 3 tokens (“color,” “fruit,” and “song”). | Source: Author

While restricting sampling to the top k tokens will reliably weed out output sequences with extremely low probability, there are two scenarios where it falls short:

  • Say we set k to 10. If there are just three tokens with high logit values, we’ll nevertheless include seven additional tokens to sample from, even though they are highly unlikely. Indeed, since we’re redistributing the probability mass, we’re even (ever so slightly) elevating the likelihood they’re chosen.
  • If there is a large number of tokens with roughly the same logit values, restricting the set of tokens to k excludes a lot of tokens that are just as likely. In the extreme case where m > k tokens have the same logit value, which tokens we select would be an artifact of our sorting algorithm’s implementation.

Top-p sampling

The concept behind top-p sampling, also known as nucleus sampling, is similar to top-k sampling. However, instead of choosing the top k tokens, we select a set of tokens whose cumulative probability is equal to or greater than p. In other words, it dynamically adjusts the size of the considered token set based on the specified probability threshold, p.

Unlike top-k, top-p doesn’t aid in computational efficiency, as it requires having the probabilities calculated to apply the algorithm. Nonetheless, it ensures the most relevant tokens are used in all cases.

In theory, it doesn’t truly solve the problems described in the top-k section. It’s still possible that low-likelihood tokens are included and that high-likelihood tokens are excluded. However, in practice, top-p has proven to work well, as the number of candidates considered rises and falls dynamically, corresponding to the changes in the model’s confidence region over the vocabulary, which top-k sampling fails to capture for any one choice of k. (If you’re curious and want to dive into more detail, have a look at the paper Neural Text Generation with Unlikelihood Training.)

Generating an LLM’s output using top-p sampling with p=0.9
Generating an LLM’s output using top-p sampling with p=0.9. We select only the first two tokens (“color” and “fruit”) because their cumulative probability surpasses the threshold of 0.9 as defined by our parameter p. | Source: Author

An example of this can be seen in the diagram above, where only the tokens “color” and “fruit” are selected since their cumulative probability exceeds the defined threshold of 0.9. Had we used top-k sampling with k=3, we would have also selected the word “stop,” which may not be highly relevant to the input. This illustrates how top-p sampling effectively filters out less pertinent options, focusing on the most relevant tokens to the context, thus maintaining coherence and relevance in the generated content.

It’s also possible to use top-p and top-k strategies together, with the generation halting on whichever condition is satisfied first. This hybrid approach leverages the strengths of both methods: top-k’s ability to limit the selection to a manageable subset of tokens and top-p’s focus on relevance by considering only tokens that collectively meet a certain probability threshold.

Temperature

If you want to have even more control over the “creativity” of your responses, perhaps the most important parameter to adjust is the softmax function’s “temperature.”

Typically, the softmax function is represented by the equation:

When we introduce the temperature parameter, T, the formula is modified to look like this:

In the figures below, we can see the effect of varying the temperature T:

Bar chart illustrating resulting token probabilities for T=1.
Resulting token probabilities for T=1. | Source: Author
Bar chart illustrting resulting token probabilities for T≈0.
Resulting token probabilities for T≈0. | Source: Author
Bar chart illustrating resulting token probabilities for T ≫1.
Resulting token probabilities for T ≫1. | Source: Author

If we set T=1, we’ll get the same result as before. However, if we use a value greater than 1 (a higher temperature), we reduce the difference between the probabilities of logits with high and low values, consequently bringing the probabilities of likely and unlikely tokens closer together. Conversely, we can also do the opposite by using a value for T less than 1. This will favor likely tokens further and reduce the probability of unlikely tokens. This manipulation of temperature allows for nuanced control over the diversity and predictability of the model’s output.

Advanced techniques for constraining sampling

Beyond the techniques and strategies we’ve discussed so far, there are other methods to adjust probabilities for sampling. Here, we’ll discuss some techniques used to guide the generation of text through constraints applied before the softmax function.

Constraints are applied to the vector of logits, generating a new vector of constrained logits.
Constraints are applied to the vector of logits, generating a new vector of constrained logits. This vector is then fed into the softmax function to produce the token probabilities for sampling. | Source: Author

Frequency penalty

If you’ve used somewhat smaller LLMs, such as those with a few million parameters, or tried using a model trained in one language for another, you might have noticed that repetitions in the responses are quite common. (This behavior has been studied in detail, as documented in the paper Neural Text Generation with Unlikelihood Training.)

The output of an LLM trained on English-language text when prompted with a Portuguese QA task. The model produces the very same sentence over and over again. | Source

Since this behavior is so common for LLMs, it would be very interesting to penalize tokens that have already been generated to prevent them from reappearing unless they truly have a high probability of occurrence.

This is where frequency and presence penalties come into play. These penalties work by modifying the logits. Specifically, the adjustment is made according to the formula:

𝘈𝘥𝘫𝘶𝘴𝘵𝘦𝘥 𝘓𝘰𝘨𝘪𝘵 ═ 𝘖𝘳𝘪𝘨𝘪𝘯𝘢𝘭 𝘓𝘰𝘨𝘪𝘵 – (𝘍𝘳𝘦𝘲𝘶𝘦𝘯𝘤𝘺 𝘊𝘰𝘶𝘯𝘵 *𝘍𝘳𝘦𝘲𝘶𝘦𝘯𝘤𝘺 𝘗𝘦𝘯𝘢𝘭𝘵𝘺) – (𝘏𝘢𝘴 𝘈𝘱𝘱𝘦𝘢𝘳𝘥? * 𝘗𝘳𝘦𝘴𝘦𝘯𝘤𝘦 𝘗𝘦𝘯𝘢𝘭𝘵𝘺)

Where “Original Logit” is the model’s initial guess at the next token, “Frequency Count” is the number of times a token has been used, “Has Appeared?” reflects if the token has been used at least once, and “Frequency Penalty” and “Presence Penalty” control how the original logits are adjusted.

To subtly discourage repetition, one might set the penalties between 0.1 and 1. For a more pronounced effect, increasing the values up to 2 can significantly reduce repetition, though at the risk of affecting the text’s naturalness. Intriguingly, negative values can be employed to achieve the opposite effect, encouraging the model to favor repeated terms, which can be useful in certain contexts. Through this formula, LLMs can finely balance novelty and repetition, ensuring content remains fresh and engaging without becoming monotonous.

Logit bias

Imagine you’re attempting to use your LLM for classification, where the classes are “red,” “green,” and “blue.” You expect the next token (the token to be predicted) to be one of these classes, but often, an unexpected token emerges, disrupting your workflow.

One way to address this issue is by using a technique known as logit bias, a term also used in the APIs of OpenAI and AI21 Studio. This method specifies a set of tokens and a bias value to be added to the logit of each token, thereby altering the probability of selecting that token during prediction.

Adjusting token probabilities through bias values.
Adjusting token probabilities through bias values. The token “green” receives a bias of +3, shifting its softmax-calculated probability to be the most likely choice. | Source: Author

The effect of logit bias can differ across models and contexts. Generally, bias values between -1 and 1 can finely tune the likelihood that a token is selected. Values of -100 or 100 can either completely remove a token from consideration or guarantee its selection. Thus, bias values are a versatile tool for ensuring the desired outcome in classification tasks.

Structured outputs

As mentioned at the beginning of the article, you’ve likely noticed that some LLMs, like ChatGPT, can generate responses in JSON format. This capability can be incredibly useful, depending on your application. For instance, if your workflow includes a step where it searches for specific entities in the text, relying on a parsable structure to proceed is crucial. Another example could be generating an output in SQL query format for data analysis.

What these examples have in common is the expectation that the output has a specific, structured format that allows subsequent stages of a workflow to perform operations based on them.

There are two main approaches to enable LLMs to generate these structured outputs:

  • Prompt Engineering: Crafting a prompt indicating the format in which the model should generate a response. This is the simpler method of the two, as it does not require changes to the model and can be applied to models offered via an API only. However, there’s no guarantee the model will always follow the instructions.
  • Fine-tuning: Further training the pre-trained LLM on task-specific input/output pairs. This approach is more commonly used for this type of problem. Although it’s not entirely guaranteed to produce the expected output format, it’s much more reliable than using prompt engineering. Moreover, it’s worth noting that you tend to process fewer tokens at inference time since you don’t need to pass lengthy and complex instructions. This makes inference faster and more cost-effective.

One of the hardest challenges when implementing post-processing algorithms is validating that your implementation is correct. LLMs usually have large vocabularies, and it is hard to know the expected output.

The best way to overcome this is to mock the logit vector and apply the algorithms to this mocked vector. This way, you can reliably compare the actual to expected outputs. Here’s one example of how to test the greedy decoding algorithm:

import numpy as np

from typing import List

# Greedy decoding algorithm

def greedy_decoding(logit_vector: np.ndarray, vocabulary: List[str]):

# Select the index with the highest score in the logit vector

max_index = np.argmax(logit_vector)

# Returning the corresponding word

return vocabulary[max_index]

vocabulary = ['hello', 'world', 'goodbye', 'the', 'is'] # Vocabulary of the model

logit_vector = np.array([2.1, 3.5, -1.2, 1.2, 0.2]) # Mocking logit vector

assert len(vocabulary) == logit_vector.shape[-1]

# Apply greedy decoding

next_token = greedy_decoding(logit_vector, vocabulary)

print("Next token:", next_token)

>> Next token: world

Sample implementation of a test of the greedy decoding algorithm in Python using the numpy library. As expected, the token with the highest logit value (“world”) was selected.

However, unless you want to implement these algorithms to learn, you do not have to implement them yourself. Most LLM APIs and libraries already include the most common approaches.

In OpenAI’s API, you can specify various arguments, like temperature, top_p, response_format, presence_penalty, frequency_penalty, and logit_bias. The transformer library from Hugging Face implements various text generation strategies as well.

If you are aiming for more structured outputs, you should check the constrained generation features of the guidance library. guidance allows you to use regex, context-free grammar, and simple prompts to constrain your generation. Another helpful tool is instructor, which uses Pydantic models to perform data extraction and introduces a concept called Validator, which uses an LLM to validate if the output matches some pattern.

Conclusion

Throughout this article, we’ve delved into the key techniques used to modify the outputs of LLMs. Given the burgeoning nature of this field, new techniques will likely emerge. However, with a solid understanding of the methods we’ve discussed, you should be well-equipped to grasp any new strategies that come along.

I hope you enjoyed the article and learned more about the often under-discussed topic of output post-processing. If you get the chance, I encourage you to experiment with these techniques and observe the diverse outcomes they can produce!

FAQ

  • Large Language Models (LLMs) are huge deep-learning models pre-trained on vast data. These models are usually based on an architecture called transformers, and their goal is to predict the next token given an input. Examples include ChatGPT, Llama, Gemini, Gemma, and Mistral.

  • You can adjust the model’s output through fine-tuning, setting parameters like temperature and frequency penalty, or using specific post-processing techniques such as greedy decoding, beam search, top-k sampling, and top-p sampling.

  • The softmax function converts a vector of logits, which represent unnormalized probabilities for each potential next token, into actual probabilities. The shape of the resulting probability distribution is controlled by the temperature parameter.

  • Greedy decoding selects the next token with the highest probability at each generation for the output sequence. Beam search keeps track of multiple possible next tokens (beams) to potentially find a sequence with a higher overall probability than greedy decoding would.

  • Top-k sampling limits the selection to the k most likely tokens. Top-p (nucleus) sampling dynamically chooses tokens that cumulatively reach a certain probability threshold, p, aiming to balance relevance and variety.

  • Adjusting the temperature parameter modifies the softmax function, influencing the diversity and predictability of a large language model’s output. A higher temperature results in more diverse (but potentially less coherent) outputs, while a lower temperature favors more predictable (but less varied) outputs.

  • Frequency and presence penalties are techniques used to adjust the likelihood of tokens being repeated in a large language model’s output. The frequency penalty decreases the probability of tokens that have already appeared, aiming to reduce repetition and enhance the diversity of the text. The presence penalty further discourages the repetition of tokens that have been used at least once. These adjustments are crucial for generating more coherent and engaging text, especially in longer outputs where repetition might otherwise detract from the quality.

  • Structured outputs, such as responses in JSON format or SQL queries, can be generated through prompt engineering or fine-tuning. Prompt engineering involves crafting a prompt that indicates the desired format of the response. Fine-tuning, on the other hand, trains the LLM on specific input/output pairs to produce outputs in a particular format. Both methods aim to produce structured and parsable text that can be directly utilized in subsequent computational processes, enhancing the model’s utility for various applications.

Was the article useful?

Thank you for your feedback!
Thanks for your vote! It's been noted. | What topics you would like to see for your next read?
Thanks for your vote! It's been noted. | Let us know what should be improved.

    Thanks! Your suggestions have been forwarded to our editors