MLOps Blog

Deploying Large NLP Models: Infrastructure Cost Optimization

12 min
5th June, 2023

NLP models in commercial applications such as text generation systems have experienced great interest among the user. These models have achieved various groundbreaking results in many NLP tasks like question-answering, summarization, language translation, classification, paraphrasing, et cetera. 

Models like for example ChatGPT, Gopher **(280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) are predominantly very large and often addressed as large language models or LLMs. These models can easily have millions or up to billions of parameters making them financially expensive to deploy and maintain.

Graph showing that the size of large NLP models is increasing
The size of large NLP models is increasing | Source

Such large natural language processing models require significant computational power and memory, which is often the leading cause of high infrastructure costs. Even if you are fine-tuning an average-sized model for a large-scale application, you need to muster a huge amount of data. 

Such scenarios inevitably lead to stacking new layers of neural connections, making it a large model, moreover, deploying these models will require fast and expensive GPU, which will ultimately add to the infrastructure cost. So is there a way to keep these expenses in check?

Sure there is.

This article aims to provide some strategies, tips, and tricks you can apply to optimize your infrastructure while deploying them. In the following sections, we will explore these:

  • 1 The infrastructural challenges faced while deploying large NLP models.
  • 2 Different strategies to reduce the costs associated with these challenges.
  • 3 Other handy tips you might want to know to address this issue.

You may also like

How to Deploy NLP Models in Production

What Does GPT-3 Mean For the Future of MLOps? With David Hershey

Challenges of large NLP models

Computational resources

LLMs require is a significant amount of resources for optimal performance. Below are the challenges that are usually faced concerning the same. 

1. High computational requirements

Deploying LLMs can be challenging as they require significant computational resources to perform inference. This is especially true when the model is used for real-time applications, such as chatbots or virtual assistants. 

Consider ChatGPT as an example. It is capable of processing and responding to queries instantly within seconds (most of the time). But there are times when the user traffic seems to be higher, during those moments, the inference time gets higher. There are other factors that can delay the inference, such as the complexity of the question, the amount of information required to generate a response, et cetera. But in any case, if the model is supposed to serve in real-time, it must be capable of high throughput and low latency.

2. Storage capacity

With parameters ranging from millions to billions, LLM can pose storage capacity challenges. It will be good to store the whole model in a single storage device, but because of the size, it is not possible. 

For example, OpenAI’s GPT-3 model, with 175B parameters, requires over 300GB of storage for its parameters alone. Additionally, it requires a GPU with a minimum of 16GB of memory to run efficiently. Storing and running such a large model on a single device may be impractical for many use cases due to the hardware requirements. As such, there are three main issues around storage capacity with LLMs:

2.1 Memory limitations

LLMs require a lot of memory as they process a huge amount of information. This can be challenging, especially when you want to deploy them on a low-memory device such as a mobile phone. 

One way to deploy such models is to use a distributed system or distributed inference. In distributed inference, the model is distributed on multiple nodes or servers. It allows the distribution of the workload and speeds up the process. But the challenge here is that it may require significant expertise to set up and maintain. Plus, the larger the model, the more servers are required, which again increases the deployment cost. 

2.2 Large model sizes

The MT-NLG model released in 2022 has 530 billion parameters and requires several hundred gigabytes of storage. High-end GPUs and basic data parallelism aren’t sufficient for deployment, and even alternative solutions like pipeline and model parallelism have trade-offs between functionality, usability, and memory/compute efficiency. As the authors in the paper “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models put it, this, in turn, reduces the effectiveness of the model. 

For instance, a 1.5B parameter model on 32GB can easily run out of memory during inference if the input query is long and complicated. Even for basic inference on LLM, multiple accelerators or multi-node computing clusters like multiple Kubernetes pods are required. There are techniques discussed by researchers where they propose the idea of offloading parameters to the local RAM. But these techniques turned out to be inefficient in practical use-case scenarios. Users cannot download such large scaled models on their systems just to translate or summarise a given text. 

2.3 Scalability challenges 

Another area for improvement with LLMs is scalability. We know that a large model is often scaled using model parallelism (MP), which requires multiple storage and memory capacity. This involves dividing the model into smaller parts and distributing it across multiple machines. Each machine processes a different part of the model, and the results are combined to produce the final output. This technique can be helpful in handling large models, but it requires careful consideration of the communication overhead between machines. 

In Distributed inference, LLM is deployed on multiple machines, with each machine processing a subset of the input data. This approach is essential for handling large-scale language tasks that require input to pass through billions of parameters. 

Most of the time, MP works, but there are instances where it doesn’t. The reason being MP divides the model vertically, distributing the computation and parameters among several devices for each layer where the inter-GPU communication bandwidth is large. This distribution facilitates intensive communication between each layer in a single node. The limitation comes outside a single node which essentially leads to a fall in performance and efficiency.

3. Bandwidth requirements

As discussed previously, LLM has to be scaled using MP. But the issue we found was that MP is efficient in single-node clusters, but in a multi-node setting, the inference isn’t efficient. This is because of the low bandwidth networks. 

Deploying a large language model requires multiple network requests to retrieve data from different servers. Network latency can impact the time required to transfer data between the servers, which can result in slower performance, eventually leading to high latency and response time. This can cause delays in processing, which can impact user experience.

4. Resource constraints

Limited storage capacity can restrict the ability to store multiple versions of the same model, which can make it difficult to compare the performance of different models and track the progress of model development over time. This can be true if you want to adopt a shadow deployment strategy.

Energy consumption

As discussed above already, serving LLMs require significant computational resources, which can lead to high energy consumption and a large carbon footprint. This can be problematic for organizations that are committed to reducing their environmental impact.

Just for reference, below is the image showing the financial estimation of the LLMs, along with the carbon footprint that they produce during training.

Financial estimation of the large NLP models, along with the carbon footprint that they produce during training
Financial estimation of the large NLP models, along with the carbon footprint that they produce during training | Source

What is more shocking is that 80-90% of the machine learning workload is inference processing, according to NVIDIA. Likewise, according to AWS, inference accounts for 90% of machine learning demand in the cloud.


Deploying and using LLMs can be costly, including the cost of hardware, storage, and infrastructure. Additionally, the cost of deploying the model can be significant, especially when using resources such as GPUs or TPUs for low latency and high throughput during inference. This can make it challenging for smaller organizations or individuals to use LLMs for their applications.

To put this into perspective, it is expected that the running cost of the chatGPT is around $100,000 per day or $3M per month.

Tweet about ChatGPT costs
Tweet about ChatGPT costs | Source

Strategies for optimizing infrastructure costs of large NLP models

In this section, we will explore and discuss the possible solutions and techniques for the challenges discussed in the previous section. It is worth noting that when you deploy the model on the cloud, you choose the inference option and thereby create an end-point. See the image below. 

Graph with the general workflow for inference endpoints
The general workflow for inference endpoints | Source

Keep that in mind, and with all the challenges we discussed earlier, we will discuss techniques that can be used to optimize the cost around this infrastructure for deploying LLMs. Below are some of the steps that you can follow to deploy your model as efficiently as possible. 

Smart use of cloud computing for computational resources

Using cloud computing services can provide on-demand access to powerful computing resources, including CPUs and GPUs. Cloud computing services are flexible and can scale according to your requirements. 

One of the important tips is that you should make a budget for your project. Making a budget always helps you find ways to optimize your project that will not exceed your financial limitation. 

Now when it comes to cloud services, there are a lot of companies that offer their platform. Cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform offer a range of options for deploying LLMs, including virtual machines, containers, and serverless computing. But despite you must do your own research and calculation. For instance, you must know these three things:

  • 1 The model size.
  • 2 Details about the hardware to be used.
  • 3 Right inference option.

Once you have the details, you can actually calculate how much-accelerated computing power you need. Based upon that, you can plan and execute your model deployment. 

Learn more

MLOps Tools for NLP Projects

Calculating model size

You can see the table below, which will give you an idea of how many FLOPs you might need for your model. Once you have an estimation, you can then go ahead and find the relevant GPU in your preferred cloud platform. 

Estimated optimal training FLOPs and training tokens for various NLP model sizes.
Estimated optimal training FLOPs and training tokens for various NLP model sizes | Source

A tool that I found under the blog post named “Estimating Training Compute of Deep Learning Models”  allows you to calculate the FLOPs required for your model both for training and inference. 

A screen from a tool that calculates the FLOPs required for both training and inference
A tool that calculates the FLOPs required for both training and inference | Source

The app is based on the works of Kaplan et al., 2020 or Hoffman et al., 2022 where they show how to train a model on a fixed-compute budget. To understand more on this subject you can read the blog here.

Selecting the right hardware

Once you have calculated the required FLOPs, you can go ahead and choose the GPU. Make sure you are aware of the features that the GPU offers. For instance, see the image below to get an understanding.

The list of GPU specifications offered by NVIDIA
The list of GPU specifications offered by NVIDIA | Source

Above you can see the list of specifications that NVIDIA offers. Similarly, you can compare different GPUs and see which one suits your budget. 

Choosing the right inference option

Once you have calculated the model size and selected the GPU, you can then proceed to choose the inference option. Amazon SageMaker offers multiple inference options to suit different workloads. For instance, if you require:

  1. Real-time inference, which is suitable for low-latency or high-throughput online inferences and supports payload sizes up to 6 MB and processing times of 60 seconds.
  2. Serverless inference, which is ideal for intermittent or unpredictable traffic patterns and supports payload sizes up to 4 MB and processing times of 60 seconds. In serverless inference, the model scales automatically based on the incoming traffic or requests. At times when the model is sitting idle you won’t be charged. It offers a pay-as-you-use facility. 
  3. Batch transform is suitable for offline processing of large datasets and supports payload sizes of GBs and processing times of days. 
  4. Asynchronous inference is suitable for queuing requests with large payloads and long processing times, supports payloads up to 1 GB and processing times up to one hour, and can scale down to 0 when there are no requests.

To get a better understanding and meet your requirement, look at the image below. 

Graph with choosing model deployment options
Choosing model deployment options | Source

When all the above points are satisfied, you can then deploy the model on any of the cloud services. 

To quickly summarize:

  • 1 Set a budget
  • 2 Calculate the size of the model 
  • 3 Compute the FLOPs required for model
  • 4 Find the right GPU
  • 5 Choose the appropriate inference option 
  • 6 Research the pricing offered by various cloud computing platforms
  • 7 Find the service that suits your needs and budget
  • 8 Deploy it. 

Optimizing the model for serving

In the last section, I discussed how the size of LLMs can pose a problem for deployment. When your model is too large, strategies like model compilation, model compression, and model sharding can be used. These techniques reduce the size of the model while preserving accuracy, which allows easier deployment and reduce the associated expenses significantly. 

Let’s explore each of those in detail. 

Graph showing different techniques or strategies to optimize LLMs for deployment. 
Different techniques or strategies to optimize LLMs for deployment | Source

Model compression

Model compression is a technique used to optimize and transform an LLM into an efficient executable model that can be run on specialized hardware or software platforms–usually cloud services. The goal of model compression is to improve the performance and efficiency of LLM inference by leveraging hardware-specific optimizations, such as reduced memory footprint, improved computation parallelism, and reduced latency.

This is a good technique because it helps you to play with a different combination, set performance benchmarks for various tasks, and find a price that suits your budget.  As such, model compression involves several steps:

  1. Graph optimization: The high-level LLM graph is transformed and optimized using graph optimization techniques such as pruning and quantization to reduce the computational complexity and memory footprint of the model. This, in turn, makes the model small while preserving its accuracy. 
  2. Hardware-specific optimization: The optimized LLM graph is further optimized to leverage hardware-specific optimizations. For instance, Amazon Sagemaker provides model serving containers for various popular ML frameworks, including XGBoost, scikit-learn, PyTorch, TensorFlow, and Apache MXNet, along with software development kits (SDKs) for each container.
Illustration of Amazon Sagemaker's workflow
How AWS Sagemaker Neo works | Source

Here are a few model compression techniques that one must know.

Model quantization

Model quantization (MQ) is a technique used to reduce the memory footprint and computation requirements of an LLM. MQ essentially transforms the model parameters and activations with lower-precision data types. The goal of model quantization is to improve the efficiency of LLM during inference by reducing the memory bandwidth requirements and exploiting hardware-specific optimizations optimized for lower-precision arithmetic.

PyTorch offers model quantization, their API involves the reduction of model parameters by a factor of 4, while the memory bandwidth required by the model by the factor 2 to 4 times. As a result of these improvements, the inference speed can increase by 2 to 4 times, owing to the reduction in memory bandwidth requirements and faster computations using int8 arithmetic. However, the precise degree of acceleration achieved depends on the hardware, runtime, and model used.

There are several approaches to model quantization for LLMs, including:

Model quantization can be challenging to implement effectively, as it requires careful consideration of the trade-offs between reduced precision and model accuracy, as well as the hardware-specific optimizations that can be leveraged with lower-precision arithmetic. However, when done correctly, model quantization can significantly improve the efficiency of LLM inference, enabling better real-time inference on large-scale datasets and edge devices.

  1. Post-training quantization: In this approach, the LLM is first trained using floating-point data types, and then the weights and activations are quantized to lower-precision data types post-training. This approach is simple to implement and can achieve good accuracy with a careful selection of quantization parameters.
  2. Quantization-aware training: Here, the LLM is quantized during training, allowing the model to adapt to the reduced precision during training. This approach can achieve higher accuracy than post-training quantization but requires more computation during training.
  3. Hybrid quantization: It combines both post-training quantization and quantization-aware training, allowing the LLM to adapt to lower-precision data types during training while also applying post-training quantization to further reduce the memory footprint and computational complexity of the model.
Model Pruning

Model pruning (MP) is again a technique used to reduce the size and computational complexity of an LLM by removing redundant or unnecessary model parameters. MP is to improve the efficiency of LLM inference without sacrificing accuracy.

MP involves identifying and removing redundant or unnecessary model parameters using various pruning algorithms. These algorithms can be broadly categorized into two categories:

  1. Weight pruning: In weight pruning, individual weights in the LLM are removed based on their magnitude or importance, using techniques such as magnitude-based pruning or structured pruning. Weight pruning can significantly reduce the number of model parameters and the computational complexity of the LLM, but it may require fine-tuning of the pruned model to maintain its accuracy.
  2. Neuron pruning: In neuron pruning, entire neurons or activations in the LLM are removed based on their importance, using techniques such as channel pruning or neuron-level pruning. Neuron pruning can also significantly reduce the number of model parameters and the computational complexity of the LLM, but it may be more difficult to implement and may require more extensive retraining and maybe fine-tuning to maintain accuracy.

Here are a couple of approaches to model pruning:

  1. Post-training pruning: In this approach, the LLM is first trained using standard techniques and then pruned using one of the pruning algorithms. The pruned LLM is then fine-tuned to preserve its accuracy.
  2. Iterative pruning: Here, the model is trained using standard training techniques and then pruned iteratively over several rounds of training and pruning. This approach can achieve higher levels of pruning while preserving accuracy.

You can explore this Colab notebook by PyTorch to better understand MP. 

Model distillation

(MD) is a technique used to transfer knowledge from an LLM called a teacher to a smaller, more efficient model called the student. It is used in the context of model compression. In a nutshell, the teacher model provides guidance and feedback to the student model during training. See the image below.

Illustration of DistilBERT’s distillation process
DistilBERT’s distillation process | Source

MD involves training a student, a more efficient model to mimic the behavior of a teacher, more complex LLM. The student model is prepared using a combination of labeled data and the output probabilities of the larger LLM. 

There are several approaches to model distillation for LLMs, including:

  1. Knowledge distillation: In this approach, the smaller model is trained to mimic the output probabilities of the larger LLM using a temperature scaling factor. The temperature scaling factor is used to soften the output probabilities of the teacher model, allowing the smaller model to learn from the teacher model’s behavior more effectively.
  1. Self-distillation: In this approach, the larger LLM is used to generate training examples for the smaller model by applying the teacher model to unlabeled data. The smaller model is then trained on these generated examples, allowing it to learn from the behavior of the larger LLM without requiring labeled data.
  1. Ensemble distillation: In this approach, multiple smaller models are trained to mimic the behavior of different sub-components of the larger LLM. The outputs of these smaller models are combined to form an ensemble model that approximates the behavior of the larger LLM.

Optimizing hardware and software requirements

Hardware is an important area when it comes to deploying LLMs. Here are some useful steps you can take for optimizing the hardware performance:

  1. Choose hardware that matches the LLM’s requirements: Depending on the LLM’s size and complexity, you may need hardware with a large amount of RAM, high-speed storage, or multiple GPUs to speed up inference. Opt for hardware that provides the necessary processing power, memory, and storage capacity, without overspending on irrelevant features.
  1. Use specialized hardware: You can use specialized hardware such as TPUs (Tensor Processing Units) or FPGAs (Field-Programmable Gate Arrays) that are designed specifically for deep learning tasks. Similarly, accelerated linear algebra or XLA can be leveraged during inference time. 

Although such hardware can be expensive, there are smart ways to consume them. You can opt for charge-on-demand for the hardware used. For instance, elastic Inference from AWS Sagemaker helps you lower your cost when the model is not fully utilizing the GPU instance for inference. 

  1. Use optimized libraries: You can use optimized libraries such as TensorFlow, PyTorch, or JAX  that leverage hardware-specific features to speed up computation without needing additional hardware. 
  1. Tune the batch size: Consider tuning the batch size during inference to maximize hardware utilization and improve inference speed. This inherently reduces the hardware requirement, thus cutting the cost. 
  1. Monitor and optimize: Finally, monitor the LLM’s performance during deployment and optimize the hardware configuration as needed to achieve the best performance.

Cost efficient scalability

Here’s how you can scale your large NLP models while keeping costs in check:

  1. Choose the right inference option, that scales automatically like the serverless inference option. As it will reduce the deployment cost when the demand is less. 

A rigid architecture will always occupy the same amount of memory even when the demand is low thus the deployment and maintenance costs will be the same. On the contrary, a scalable architecture can scale horizontally or vertically to accommodate an increased workload and go back to its original configuration when the model lies in a dormant state. Such an approach can reduce the cost of maintenance whenever the additional nodes are not being used. 

  1. Optimize inference performance, by using hardware acceleration, such as GPUs or TPUs, and by optimizing the inference code.
  1. Amazon’s Elastic inference is yet another great option as it reduces the cost by up to 75% because the model no longer has extra GPUs to compute for inference. For more on Elastic inference, read this article here

Cutting energy costs

  1. Choose an energy-efficient cloud infrastructure, that uses renewable energy sources or carbon offsets to reduce the carbon footprint of their data centers. You can also consider choosing energy-efficient GPUs. Check out this article by Wired to understand more. 
  1. Use caching which helps reduce the computational requirements of LLM inference by storing frequently requested responses in memory. This can significantly reduce the number of computations required to generate responses to user requests. It also helps in addressing bandwidth issues as it reduces the time to access data. You can store frequently accessed data in cache memory so that it can be quickly accessed without the need for additional bandwidth. This allows you not to opt for additional storage and memory devices. 

Deploying large NLP models: other useful tips

Estimating the NLP model size before training

Keeping your model size in check could in turn keep your infrastructure costs in check. Here are a few things you can keep in mind while getting your large NLP model ready.

  1. Consider the available resources: The size of the LLM for deployment should take into account the available hardware resources, including memory, processing power, and storage capacity. The LLM’s size should be within the limits of the available resources to ensure optimal performance.
  2. Fine-tuning: Choose a model with optimal accuracy and then fine-tune it on a task-specific dataset. This step will increase the efficiency of the LLM and keep its size from spiralling out of control.
  3. Consider the tradeoff between size and performance: The LLM’s size should be selected based on the tradeoff between size and performance. A larger model size may provide better performance but may also require more resources and time. Therefore, it is essential to find the optimal balance between size and performance.

Use a lightweight deployment framework

Many LLMs are too large to be deployed directly to a production environment. Consider using a lightweight deployment framework like TensorFlow Serving or TorchServe that can host the model and serve predictions over a network. These frameworks can help reduce the overhead of loading and running the model on the server thereby reducing the deployment and infrastructure costs.

Post-deployment model monitoring

Model monitoring helps optimize the infrastructure cost of deployment by providing insights into the performance and resource utilization of deployed models. By monitoring the resource consumption of deployed models, such as CPU, memory, and network usage, you can identify areas that can help you optimize your infrastructure usage to reduce costs. 

  • Monitoring can identify underutilized resources, allowing you to scale back on unused resources, and reducing infrastructure costs. 
  • Monitoring can identify resource-intensive operations or models, enabling organizations to optimize their architecture or refactor the model to be more efficient. This can also lead to cost savings. 

Check also

Tips and Tricks to Train State-Of-The-Art NLP Models

Key takeaways

  • 1 Set a budget.
  • 2 Calculate the size of the model.
  • 3 Use model compression techniques like pruning, quantization, and distillation to decrease the memory and computation required for deployment.
  • 4 Utilize cloud computing services like AWS, Google Cloud, and Microsoft Azure for cost-effective solutions with scalability options.
  • 5 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
  • 6 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
  • 7 Optimize hardware acceleration, such as GPUs, to speed up model training and inference.
  • 8 Regularly monitor resource usage to identify areas where costs can be reduced, such as underutilized resources or overprovisioned instances.
  • 9 Continuously optimize your model size and hardware to cost-efficient inference. 
  • 10 Update the software and security patch to ensure safety. 


In this article, we explored the challenges we face when deploying an LLM and the inflated infrastructural cost associated with them. Simultaneously, we also addressed each of these difficulties with the necessary techniques and solutions. 

Out of all the solutions we discussed, a couple of things that I would recommend the most when it comes to reducing infrastructure cost while deployment is elastic and serverless inference. Yes, model compression is good and valid, but when the demand is high, even the smaller model can act like a larger model, thus increasing the infrastructural cost. Thus, we need to have a scalable approach and pay-per-demand service. That’s where these inference services get handy. 

It goes without saying that my recommendation might not be the most ideal for your use case, and you can pick any of these approaches depending on the kind of problems you are dealing with. I hope what we discussed here will go a long way in helping you cut down your deployment infrastructure costs for your large NLP models. 


  1. Large Language Model Training in 2023
  3. Top 10 AI Chip Makers of 2023: In-depth Guide 
  5. LLaMA: A foundational, 65-billion-parameter large language model
  13. Jaime Sevilla et al. (2022), “Estimating Training Compute of Deep Learning Models”. Published online at Retrieved from: ‘‘ [online resource]
  21. Improving Language Model Behavior by Training on a Curated Dataset
  25. Large Language Models Can Self-Improve

Was the article useful?

Thank you for your feedback!