MLOps Blog

Deploying ML Models on GPU With Kyle Morris

25 min
27th July, 2023

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners. 

Every episode is focused on one specific ML topic, and during this one, we talked to Kyle Morris from Banana about deploying models on GPU. 

You can watch it on YouTube:

Or listen to it as a podcast on:

But if you prefer a written version, here it is! 

You’ll learn about:

  • 1 Things to know before starting out with deploying GPUs
  • 2 GPU tools and their utility
  • 3 Common misconceptions around GPUs
  • 4 Getting the most out of GPUs
  • 5 Best GPU use-cases
  • 6 Managing trade-offs with GPUs
  • 7 …and more.

Sabine: Hello, everyone, and welcome to MLOps Live. This is an interactive Q&A session with our guest today, Kyle Morris. Who is an expert in today’s topic, which is deploying models on GPU. Kyle, to warm you up a little bit. How would you explain deploying models on GPU in one minute?

Kyle Morris: Yes, good question.

I would say it feels similar to deploying a website in the early 2000s. When I started my career, there was no infrastructure to do it. We’ve built a lot of tooling to get websites live and be able to handle things like scalability and balancing out traffic, deploying in different countries. Once you’re doing stuff with GPUs, that infrastructure doesn’t exist. I would call it a very similar underlying infrastructure until you dig into the weeds and realize that it hasn’t been built yet.

When you ask yourself why you realize that there are a lot of unique latency edge cases. I’d say the name of the game with GPUs is latency. Slow boot times, slow inferences if you’re doing machine learning, and things like that are what blocked the market. The faster you get latency in a production setting, and the cheaper you’re able to host, the more accessible it is to the world. Cost and speed are the two things blocking productionizing GPU-based systems. I’d say that’s the main difference from the traditional system.

Deploying GPUs: things to know before starting out

Stephen: Just to follow up on the question, what are the three things you wished you knew while you were starting out? Things around deploying GPUs, and generally just working with GPU-based models?

Kyle: There are two things I wish I knew when I started. One was containerization, being really proficient at that. Using Docker, Kubernetes, Pulumi, setting up, and being able to containerize apps that rely on GPUs. Knowing how to set up CUDA drivers, there’s a lot of time that you will sink into doing that. Say you’re an early startup, and you’re trying to deploy a production application, and you’re not familiar with those tools, you have almost no hope with GPUs.

The second thing that I feel like it’s really unintuitive but interesting is that the applications that you use to do machine learning are primarily built for training. Pytorch, TensorFlow, these are research development tools, and Flask, a lot of the servers that you’ll set up, you’ll be in a dev mode. The big misconception people have is they go to production with tools that are meant for dev, and they don’t realize…

For a concrete example, the number of times we’ve had customers try to give us production applications where they haven’t enabled GPU. They’ve launched on a GPU, but they’re actually just using the CPU, they don’t even realize that they haven’t done it. Then when you go under the hood, you realize your Pytorch script is deploying on a CPU. Little things like that, but then it goes deeper, and you realize that these tools haven’t been built to run these things fast. That’s the whole angle I’ve been coming in on, realizing we need a whole new suite of tools to deploy production GPU applications.

GPU tools used for deployment

Stephen: You talked about most of these tools being optimized for training themselves. I just wanted to touch on those tools a little bit, just before we dig deeper into some other areas. What are those tools you use for deployment, for example? How relevant are they? Are there principles I have to learn just to think about deploying on GPUs? 

Kyle: I think it’s from any sense of programming, understanding the hardware you’re running on is really important. I’m a robotics engineer by trade. I worked on autonomous cars before this, and if you don’t understand how the hardware executes, you can’t take advantage of the speed optimizations. That ties in with people not realizing they’re deploying on the wrong hardware in the first place.

A GPU machine on GCP, or AWS has a CPU on it. People don’t realize they’re not even taking advantage of the system. Then once you are, you need to realize that GPUs have a different way of reading, writing, and using memory. You can do a lot of things to speed up models significantly more than you would in a training setting. I think having a basic understanding of how GPU memory works versus CPU memory, just a basic operating system understanding, will give you a new lens. Realizing that it’s not all just compute under the hood, there’s a different extraction you’ll want to dig into if you’re seriously trying to stress test production machine learning, like GPU applications.

GPUs vs CPUs in production

Stephen: Right. In terms of computes, what are the biggest differences you’d say teams should know about in terms of when they’re deploying on GPUs, for example, versus CPUs? 

Kyle: Yes. I think the biggest thing is the GPU drivers you’re using and understanding whether they are optimized for an ML setting. Are you actually loading to and from a GPU quickly? For example, in the work I’ve done, I’ve taken models that take 30 minutes to load, and I’ve optimized them to load in 10 seconds. Then people are like, “That’s impossible.” It’s like, “No, you haven’t gone under the hood, realize that you’re loading it in a sequence like you would on a CPU, and GPUs don’t work that way. You can do these new things.” I really think an understanding of the memory interface is a big difference.

When you look at a computer, you have processing, like a unit, you have your operating system that reads from memory, but then the GPU and CPU, the fundamental difference is parallelism in instructions. Understanding if you’re actually getting full use of parallelism. I’d be recommending digging into tools that allow you to visualize GPU usage, like monitoring software, basically, to actually see you are taking advantage of this expensive hardware you’re using.

Again, to attach stack to it, I’ve worked…

I’m in the ML hosting space, and I’ve worked with dozens of people, and 90% of people aren’t utilizing the GPU- more than half, and they don’t realize it. Even experts like people coming from Tesla, Cruise, and car companies they’re still just not taking advantage. Most CPU tooling has been built to take advantage of the CPU we’ve had for many years. With GPUs, there hasn’t been a lot of production machine learning yet. It’s just, it’s just starting to become a big thing, so that’s where the big advantage is.

Common misconceptions around GPUs

Stephen: Are there other mistakes you see that people make, especially making sense of how they utilize or do not utilize their deployments on the GPUs?

Kyle: What I said earlier about not actually using the GPU is the biggest, the most expensive mistake. People will auto-scale up to 10 GPUs to handle the traffic. They’ll come to me and say, “Hey, I need to make inference faster.” Then I’ll realize you’re using a CPU under the hood, but they’ve deployed on a GPU, and it happens that the CPU they’re using is faster, so they get like a 20% speed up. They think that’s the power of the GPU. Then we’ll say, “No, actually run this on the GPU properly and enable parallelism”, like a lower level.

I think the other block is the… If you’re not familiar with the Python Global Interpreter Lock. It’s really specific, but that blocks a lot of ML applications because in Python, the language… Most machine learning applications are written in Python right now, and it disables true concurrency or true parallelism. If you figure out how to port something to C++, or another is like… I haven’t worked much with Golang for machine learning or about infra, you can have true parallelism. I’m talking like 30x speed-ups, not 2x, 3x. I mean, like 10x, 30x, 32x, specifically, based on GPU sizes.

You can do a lot of really cool stuff. It takes work, but yes, that’s a big blocker under the hood that people don’t usually realize. Then they don’t ever get low enough to recognize it. They just think, “Oh, the hardware must be slow”. I think a common mistake is people blame the hardware too soon.

They don’t realize that GPU hardware is pretty powerful if you know how to hack it. If you’re a new engineer, the typical thing I’ll see is they’ll try CPU and be like, “I need faster. Throw it on a GPU.”

They realize they’re not even using it, or they don’t realize, and then they’ll say, “I need faster, use a TPU”. They’re just using this $ 3,000 a month’s hardware, and they’re throwing out-of-the-box stuff on it. I guess the point is if you dig in, there are like 10x to 30x improvements you can make. It takes work, that’s why I’m trying to democratize that, I’d like to start building the tooling there. That’s my focal point. I’m making it so others don’t have to. If you’re way ahead on the frontier, say, you’re working at a big ML company, and there’s nothing available, that’s where you can create a lot of value.

Getting the most out of GPUs

Stephen: I’ll level a little deep into it just before going to the community questions. The main goal of using GPU of course, for anything is to speed up the inference time. What are the first things you would do, especially when considering the fact that you want to optimize whatever deployments you are doing? Whether it’s from the software side or the hardware side?

Kyle: Yes. The first thing is understanding what hardware you’re running on and what are the fundamental limitations of that hardware. Then two, understand how much of it you’re using and know how to use debuggers. For CPU, for most code, if you’re doing Python, you can do step-3 debugging. Once you get down to CUDA, there’s this level where a lot of people just stop looking under the hood, they say, “Oh, that’s CUDA. It’s not my thing.” Maybe this community is a bit different.

Give yourself a weekend to just be like, I’m going to be a GPU engineer and dive into CUDA. Don’t attach it to an outcome, just force yourself to get in the weeds a little. Learn how CUDA allocates memory to be able to profile it using… even though the most basic command-line tool its NVIDIA-SMI. Then be able to understand how you do the equivalent of step-3 debugging on a GPU and see how much memory is being used.

A good first step is just what’s my… say you’re running an ML inference. Here’s a really common way that basically if I gave this to everyone, and I was like, “go try this”. Half of you will probably find a way you could better use your GPU. This thing would be running your application. Let’s say it’s an ML server and look at how much GPU memory is there. Then just think to yourself “Am I using the full GPU in the first place?” Then realize that you can parallel load things. That’s a huge advantage.

Then you’ll start realizing that there are a lot of infrastructures blocking memory virtualization. Actually, it’s hard to do that. There are, again, these deep technical problems. I’m not saying these are easy, quick wins, but if you’re really trying to push big changes on these systems, that’s where you’d start. Just understanding, am I using all this GPU capacity, or is there more available? Then you can start doing batch inferences so you can adjust your model. Hopefully, it’s okay I’m using very ML-centric. I assume that’s the audience, but if you’re doing ML workloads, you can start batch processing inputs, and then you can say, go.

So far, I’ve had models where it takes 3GB of RAM. I start batch processing. I’m doing basically inferences in parallel, and now I’m using 12GB of RAM. It’s still one GPU, except for doing higher throughput. That means 4x lower cost for inference and stuff like that. That all starts with understanding how to debug what you’re doing. How to actually open it up and see what’s going on and then just ask questions.

The better you can debug, the better tooling you have for debugging, and the quicker these obvious things will jump up. People are smart, you’ll see what’s wrong. But if you don’t have a debugging tool, everything’s just a black box. Then GPU is just this amorphous box that does things with numbers. You just won’t be able to take advantage of the performance they can offer.

May be useful

When training ML models, you can monitor your hardware consumption (GPUs, CPUs, memory) using 

Navigating through current ML frameworks

Stephen: Right. We’re going to come back to the understanding and tooling a bit later on. I think we have a couple of questions from the community.

Sabine: Yes. Kyle, you definitely touched upon this already. Pietra, in chat, also notes that before ML frameworks like TensorFlow, you had to go really low-level and code in a native CUDA. It was harder and took way more time to develop, but you had a lot of control and the ability to optimize your code. He wants to know, with the current frameworks, what can you really do to make the inference faster? What are typical bottlenecks, and what can you do with them?

Kyle: Yes, I can speak that you definitely can. I’m working on a project called, basically trying to optimize ML for production. One example is we went in and made GPTJ, a really big transformer model load, almost 100 times faster. Purely by opening up libraries, hacking stuff under the hood, and just starting to ask simple questions about “How are we allocating memory? Is this the best we can do? What is this operating system allowing us to do?” So, you definitely can. Yes, I agree. When I started programming, I was writing gradient descent from scratch in C++. I get that these new abstractions have removed that.

So you stop asking yourself those questions, but you can do a lot from the Python angle. There’s a certain level where you have to start going down below the global interpreter lock into C++ again because to enable true parallelism, you definitely can. I think the differences are looking at training optimizations versus production inference. Those are very different categories of optimizing models. Training is more about paralleling across machines, how do you converge on a set of weights faster? Whereas production inference is like, how do you do a forward pass?

Firstly, training is a backward pass, production is a forward pass. Very similar thing, except in production, you’re trying to minimize how much computing you’re using. In training, you’re like, how do I maximize compute parallelism in order to train faster? You just have to do a bunch of these things at once. It’s more about throughput optimization on training. In production, it’s about individual call latency and price per call. There’s a whole suite of things, just think about the majority of people. How many conversations you’ve had about training optimization versus production. It’s like people haven’t really explored it a lot yet. I see there’s a second question, but…

Are GPUs suitable for real-time inference?

Sabine: Yes, exactly. This second question is exactly about using GPU for inference in production. The question is, is it only for batch predictions, or can you really use it for near real-time? For example, REST API endpoint using GPU under the hood.

Kyle: Yes, I’m assuming again, this means machine-learning inference and considering using GPU production. The question depends on your use case. Typically, I can speak, I’ve taken maybe 50-100 projects from CPU to GPU. The average speed-up you’ll see is 3-8x on inference just running something through a GPU, like an existing framework. You’ll see end-to-end latency faster, and you can use something like a REST API.

What I have built under the hood a bunch of times is basically an SDK that calls a backend, which is just a rest API. Gets a task ID, pulls it, and then you follow up with that task. You register it as a long-term thing in a database. Then you check the status of that task, and you have the task updated from the inference server. You have a middle layer that routes across.

It’s definitely faster with GPU. The big thing is the price. If you’re doing something like a chatbot, most people end up needing GPU inference because you can’t… If a customer sends a message and it takes 20 seconds to respond, that’s way different for retention than three seconds, five seconds. The question here is, should you consider using GPU for production? Yes. The big thing you need to look at is your cost then. Do you have an application where this is cost enabled?

I don’t want to shamelessly plug too much, but the big thing I’m doing in Banana, the project I’m working on, is to make GPU hosts cheaper, full disclosure. What I can say is where we’re blocked is if you have a GPU application where say, even if you have serverless GPUs, you can shut them off when you’re not using them. If the cost of having them on at all just outweighs the customer interaction, you wouldn’t want to use GPU yet.

For example, somebody comes in, they’re chatting with your chatbot, you’re using a GPU, and their chat is 10 minutes. If that 10 minutes of GPU cost is more than that customer creates or pays for your business, then you have negative unit economics. I think the main blocker isn’t that GPUs are expensive, it’s that people have them always on. The main blocker is that always-on GPUs make your unit economics crappy. You need serverless to… you could probably reduce it a lot.

I’d consider those different tradeoffs. Should you consider GPU? Definitely, it’s faster. Customers love faster experiences. Maybe you can give an edge case where you might not want to use GPU. I think if you’re doing batch workloads where the customer doesn’t need a response, say, like Friday night, you’re going to run a cron job or whatever they’re called now, a basic automated job that bashes through a bunch of data. CPUs can be significantly cheaper. That might be a good option for you, but once speed’s a factor, GPUs should be considered for ML for sure.

Watch another great talk with Kyle Morris:

Tips for real-time GPU usage

Stephen: Yes, I’ll just follow up on that question. We had a question from a community member. So, this particular team used to run GPU inference through their batch workflow. Now they’re moving to a more real-time workflow because that’s how the business has moved. That’s what serves their users a lot better. The teams now asking what tips you have for them in terms of cost savings using GPU in real-time, as well as resource optimization. I mean, that’s what we’ve been saying so far. Maybe something more specific.

Kyle: Sorry. First, I want to clarify that when you say batch, you mean there are two types of batching. There’s batching, where your calls come in at a certain time of the week, and you just process them all at once. Then the second time of batching is parallel inferencing. You’re actually sending a larger payload into your neural network. I’m assuming this needs the latter, which is you’re processing multiple entrances at once. Like you’re sending in a larger payload, and then the question is if you want to spread it out like individual calls, you don’t need to do this huge clunky call how do you make it real-time, that’s my understanding of the question.

The main thing I want to consider if you want to speed up is rewriting kernels for your use case, so thinking of the underlying main operation that a GPU does is like a kernel multiplication for a network, a lot of the time, I found this. Again, there’s this misconception of the tools you use are just perfectly optimized, and once you’re in production inference, you need to challenge that assumption a lot more.

You need to realize that if you go underneath– For example, I’ve gone into a state of the art libraries and looked at the kernels their GPUs are using. I’m like, “Hey, there’s a better way to do matrix multiplication here,” and then set up the inference like 2-3x. I’d say that’s like an immediate best way. Yes, it’s like this because you’re trying to be a single call speed kernel like multiplication is the best way to go, or experimenting with hardware that has like a faster clock cycle, anything like that because if you’re not doing batch, then those are kind of your options.

Even that CPU is worth considering if you are willing to go down the JAX path and start doing the just-in-time compilation and looking at the hardware you use, but I think kernel rewriting would be the lowest hanging fruit if you wanted 2x speed up. That may enable real-time for you.

Best GPU use-cases

Sabine: Yes, we do have a question about these use cases for GPUs. Where do you see the best use cases? Obviously, ML training and inference, especially neural nets, but is there also a benefit for some data processing workflows or things like that? Joshua is asking. What are the best use cases?

Kyle: In my opinion, again, I don’t know the ground truth. I would love for you to go discover what you think is the best because there’s a lot to learn. What I’ve seen is video processing is the next best as in customers of That’s where they come to us for doing video processing, and they’re not using machine learning, they’re just trying to do basically, I can’t get into too much detail, but basically processing large videos and doing stuff like searching, there are not neural networks, it’s just traditional processing, but on a GPU you can do that really well because you can start bashing frames which you can do like 10-30 times faster than a CPU.

I’m trying to think of what else. Robotics perception, I mean, that’s what I worked on before when I was at Cruise. Processing a scene. Again, a lot of it ends up being machine learning under the hood which just still happens, but basically, processing complex rich scenes fast is like a GPU thing. A video ends up being robotics perception because if you’re building an autonomous car, you’re constantly snapshotting the world, you’re basically creating a video, but you can also have a video of different frames like point clouds and different data types.

Anything like that where you need a high framerate that is like non-ML. ML is predicting the output of something like doing an inference where bash process and things like traditional methods. Those would be the main things I’ve seen in practice.

Problems with serverless GPUs

Stephen: Speaking of infrastructure, we have one question, in fact, from the community submitted. This person asks what do you think of serverless GPUs? What problems have you found with working with them, they are planning to migrate their workflows to serverless. The thing is, they’re thinking about the low latency tolerance and the cold start problem with serverless.

Kyle: I mean, this is my passion problem, I’m going to try and answer correctly. Basically, the understanding here is why?

Stephen: What problems have you found with using serverless GPUs pretty much?

Kyle: Yes. The first problem is that they’re not provided out there, so that’s why I’m trying to build a solution for them, like something that’s reliable. The main reason they don’t exist, from my understanding, is the ability to cold start a model. I could give you an example if you were to go build out serverless GPUs I can explain how your next three months would basically be.

It would be like if you would start off by having a model, you want to set up autoscaling or something. The first thing you’re going to find is provisioning GPUs can take multiple minutes just to get access to something. Once you have a GPU available, loading a typical model can take like 5 to 20 minutes to put into GPU memory using Torch or TensorFlow, any of the traditional libraries. You’ll need to go in and really optimize model boot time. At Banana, that’s our sole focal point. It’s how do we take booting a model from 20 minutes down to seconds. Basically, if you can do that, and your inference speed is the same speed, then you can shut the computer off.

I think the analogy is like, imagine you had a car, and you know that you might have to go do a really important trip, say, it’s a hospital or to a friend’s place or something, and you don’t know when, you just know, at some point, you’re going to have to do it, and that car takes 30 minutes to turn off, then you’re probably just going to leave the car on all the time. That’s the problem we have with GPUs is since machine learning workloads take so long, people just leave their GPU on because it’s the only way to guarantee customers have a good experience.

The main problem with serverless GPUs is enabling them, latency, and getting rid of that because then you have the equivalent of an always-on machine. They’re good at customer experience, except you’re saving 90% or more. We’re on our way to doing that. We’ve got it down 95% to 99%, but it’s still not enough in a lot of applications, real-time chatbots can be okay if you have, say, a customer comes in, and you can notify the GPU to turn on, and you have 5, 10 seconds, but if your application needs sub-second, it’s just not there yet, it’s extremely hard.

I can explain why it’s hard, but if you want to boot a machine in a sub-second, you need to be able to load memory onto a GPU in sub-seconds, and the state-of-the-art SSDs load at a 3-4GB/s rate, and so it’s just like there’s a theoretical boundary on the ability to load memory onto hardware, fast enough to put a 20GB model into memory in less than five seconds. We’re trying to look at ways around it. There’s that constraint, there might be we’re experimenting with sharded loading, loading across multiple GPUs at once, and we’re doing sharded inference, so that way, each GPU only has to load one-second work and so spreading out layers in the network, basically the problem we’re trying to solve.

Model maintenance with GPUs

Sabine: Then we have more of a model maintenance question from Piatra. Let’s assume that we have a model in production that is retrained manually by data scientists, potentially a change of network architecture, extra features, stuff like that every three months. Does it mean that the production code has to be rewritten by, for example, ML engineers manually to be optimized for GPU with each update?

Kyle: That’s a really good question. Just saying, is it just the weights that are changing, or are there other things, because ML models have different components, there is the code and the weights, basically, think about it, you just have this network defined in some language, then you have a bunch of weights loaded onto the network. Typically, if the weights change, you don’t have to change the underlying code. Weights are the things that are most frequent, if you fine-tune the model, you don’t have to change the underlying code, and therefore you don’t have to re-optimize.

The second thing is, even if the weights, or the architecture changes, typically, you’ll want to write abstractions that capture a broad set of neural nets, you don’t want to hard code your optimizations to be for one net, that’s not usually how it works. For example, if you’re doing something like Keras that’s a really high-level abstraction, they have what, they have some layers API, if you’re optimizing that, you want it to make sure that any sort of layers passed in with that API, are compiled down into optimized GPU code, or interpreted with whatever depending on what you’re using. Typically, you don’t have to change it a lot.

I’m trying to think of an exception to that because I think that’d be more valuable than just saying you don’t have to. I think an exception is if you’re doing — at Banana, we do a lot of stuff with booting stuff, booting machines fast. One cool edge case is if you keep changing the weights, or you have many different weights, you need to think of how to store and load those quickly, so storage becomes an actual bottleneck, that becomes a thing where more of your problems come from, is where do I put all these optimized weights, how do I track what needs to go into the bottle and then cache it, make that process fast? That’s where the most time will be spent, not actually your code.

Managing trade-offs with GPUs

Sabine: Thanks. We have another question from Marcelo. As you were saying, there seems to be a contradiction between having low latency when doing inference and reducing costs. If you want to be able to do inference fast enough, memory has to be loaded on memory, and there’s paying for its costs all the time, even when not used. Is there a workaround to this, basically something that gives both GPU low latency and charges by its use?

Kyle: Yes. I think it depends on the end customer experience. I’m assuming everyone’s building for customers, maybe that’s an incorrect assumption, but somebody’s consuming the output of your model, that’s my assumption. If your customers need a really quick response, you can do stuff like predictive availability of GPUs. Basically, speed and cost have a trade-off, and the only way I found to work around it is by having GPUs available when you need them, serverless GPUs, but then there are two camps, serverless GPUs still take a few seconds to boot up, so if that’s okay, for your application, I would just use serverless GPUs, personally, that’s the main way to get a ton of savings.

If your application needs a real-time response, and it’s completely unpredictable when it’s coming up, that’s another– That’s an unsolved problem still where you just basically have to make inference faster without the cost going higher. Again, the kernels are a thing. Going down the ASIC path, creating a custom circuit that performs operations on hard-coded network architecture that’s definitely another angle.

I think a simple one would be really trying to do predictive traffic, like, say, your application needs to respond quickly, and you have five GPUs that are always on, can you find a way to have just one always running and then have the other four auto-scale when they’re needed? Then, you can handle 99% of traffic spikes, except you’re not paying for five always-on machines, that’s where you’ll save the most.

I guess I’m rephrasing to how do you save money without sacrificing latency? That would be the best way I’ve thought so far. Then if you want to just do it the other way, it is just really deep R&D, solving these hardware energy bottlenecks, which is the same. At Cruise, we’re getting into this, going really, really low level, below into the hardware. Most teams probably don’t do that. If anyone here is interested in serverless GPUs, obviously, I don’t know how to get in contact, but DM me, that’s the thing I’m trying to do. I would love to help better.

Cloud vs On-Premises GPUs

Sabine: We have a bit of an on-prem versus cloud question from Srikant. How do you look at an On-Premises GPU cluster, managed by NVIDIA AI enterprise software suite in combination with Red Hat OpenShift or VMware Tanzu, over something like AWS stack or Azure stack for the same GPU cluster managed by EKS, for example?

Kyle: I want to make sure I get this. They’re asking specifically about the on-prem experience. In a VPC, a virtual private cloud, having access to your resources, I assume there’s a privacy concern or something, and you’re asking about multiple different clouds. I haven’t worked in-depth with all of them, I’ve used GCP, and AWS and experimented with some others. I think what to optimize for when you’re starting is if you’re using Kubernetes and you have a container abstraction on everything, you can basically change cloud providers fairly easily.

I found there are some headaches, but I was looking at the one that provides the abstractions that are easiest to deploy your application on as a first start because I think the bottleneck most people have early on as a company is not yet latency or cost, it’s just are they building something valuable for people, especially if you’re a startup, so I’d be optimizing for deploying quickly learning. Then what I’ve noticed is that at a large scale, a lot of these providers balance out their GCP will cost way more in one area, but then it’ll save you money on this other offering.

For example, I can’t quote exact numbers for Azure, but I know there’s a set of EC2 instances that are a lot more expensive on AWS, and they aren’t on GCP. That’s a quick win, but then you go in, and you look at network costs or more and just and just start normalizing. I’d really optimize for, like, what is the — I’d be asking questions like which one is your team familiar with? Are you using abstractions that make it easy to be cloud-agnostic? If not, that’s more of a problem.

If you’re not using stuff like Pulumi, I’d be deeply considering that. Pulumi, it’s like one of my favorite tools I found in the last year because, basically, you can just deploy on AWS, GCP using the same Python code. You don’t have to go into the console, and then that’ll reduce a lot of these decision variables. It’s focusing on the decision that saves you the most time globally, like how do you make your code cloud agnostic? Then just picking the decision is as high stakes. Right? Because really what you’re asking is like, oh, all these providers vendor lock me, and so I have to make this big decision because I know it’s hard to switch.

If you remove that assumption, you make it easy to switch. You can get the best of both worlds. You could do a multi-cloud deployment, you could say for a high GPU workload, use GCP, which I think was a very good option in my recommendation, not sure about Azure and then if you’re doing some high network bandwidth or privacy centre, AWS has a lot of good offerings like EKS.

Cutting GPU usage costs

Stephen: Right, just before we go into the community, of course, one thing we’ve been saying a lot is cost savings, especially in terms of GPUs. One of the key barriers to using GPUs, of course, is the price. You don’t want to wake up one day, and you’re having thousands of dollars because you just simply left an EC2 instance on with your GPUs and everything. What are your top tips for saving costs? We talked about serverless, which is one thing, or are there other things I could do to save costs by playing around with GPUs or deploying models and GPUs?

Kyle: That’s a good question. The easiest? I’m measuring this based on how hard it is to do and how much it saves you. The easiest is what I said earlier, which is to look at how much memory your model takes and put it on a GPU with less memory. If you can, like that, you just cut your cost in-house. That’s pretty amazing.

A lot of inference servers have memory leaks in Torch. A lot of Torch code has memory leaks, like more than 50%. What happens is it keeps creeping up in your memory trace. If you’re able to fix stuff like that, you can keep it on a machine with less GPU memory. The other would be like, how do you pack multiple models onto one GPU? That’s a problem we’re trying to solve right now. We’re working on it because under the hood, these tools, you do that naturally. There’s not an accessible virtualization layer for GPU memory, and so that becomes a headache, but that would be the second bet of like, like the natural thing. Just like, how do you put things with multiple memory on the same GPU?

Third, I think this is if you’re willing to sacrifice performance a little bit, which most teams in my experience aren’t willing to, but if you’re willing to give up 1% to 5% of performance quantization, so like reducing the size of your model weights by changing it from a float 64, float 32 down to an eight-bit float, you can reduce the model size by 2-4x, and you sacrifice performance there, but that’s like a really easy win that I’d recommend. I recommend a hybrid approach, so maybe if you’re, again, I’m speaking to seed-stage series A companies like early-stage products, but look at you’re doing a demo application.

You need speed. Can you do quantization on the model? So it’s like four times cheaper. You just have it always on, you can offer it as a demo, but then power users that need that like really good quality. The people that are paying you, you give them a different model that doesn’t have that, has that extra 5%, I’ll see a lot of teams do this where they’ll basically switch between which model they’re using depending on the customer type.

I think if you’re smart about that, if you integrate the product mindset with the engineering mindset and not everything is just an engineering problem, it’s also like, what do your customers need? You can start being clever about that. You can say, here’s the cheap model for the front-end person, and then the people that are really forking over money, they get this model. I think you can easily do each of those things, you can cut your costs in half. 

Distributed processing with GPUs

Sabine: All right. We have another question from Ricardo, which tools or frameworks would you use to build a composition of models that is an inference request that needs to be processed by many models, possibly running in a distributed environment of different machines. Each of them may be with separate resource constraints, auto-scaling policies, and such.

Kyle: Interesting, model chaining, I’m trying not to be too biased because I have one way I do this with the product I’m building, but I’m trying to challenge myself and think what are the other ways to do it. The simplest abstraction is to like treat each model as a separate entity that you can call with a couple of lines of code abstracted away. They’re not all wired on the same machine if latency–

Let’s treat, let’s look at the case where the goal is a really clean architecture. Then what you would probably want to do is have an orchestration service that is able to send a request to each of these models individually, and it basically does a pipeline of calls. If assuming these are big models, they might each run on a different server, and you would basically want a conveyor belt process that, like calls, the server gets the response plugs into the next server.

That’s a really clean design because then you can do like logging at each step around the pipeline. You’ll avoid the chaos of just running out of GPU memory, and then the optimizations would come in of, like, you’re going to have network latency, right? If you’re doing a network call in between all these and chain them together, that scales really well because then you can basically auto-scale parts of this chain based on the bottlenecks. That’ll be easier short term. Once you need long-term savings, then the bottleneck is network latency. You’d want that same architecture, but how do you pack models on the same machine? Then you can call both of them on the same server.

With Banana, what we have is like a middle layer. You have a Python SDK, you have a function like a or whatever SDK you’re using, and it calls this middle layer, which then routes it to a machine, and if you have a chain of models, you just call six different run commands, and you’ll tie them together and then behind the scenes, you can pack them on the same machine as performance optimization, but the end-user shouldn’t have to change their code.

I think the biggest headache you’ll get is having to keep changing application code to work with these new chained inference servers. Having an abstraction between the user and this lattice of models will save you a lot of headaches because each time you have to change the application code, that’s a new poll request. That’s new testing, and it really adds up, people underestimate it.

Model inference using GPUs

Sabine: Cool. We have some questions about inference. Gagan wants to know how well inference servers like TFSerb, ONNX Runtime, et cetera. help us with Q2. That is wrapping a REST API around a huge model like hugging face transformers, for example.

Kyle: To clarify, how much do these tools help as opposed to just doing that or just like in addition to– if you want to serve a model, basically, how much do these tools help.

Sabine: Yes, I would think so.

Kyle: Yes, I think the area they help a lot is if you’re in an ecosystem where you’re trying to quickly deploy an open-source model like it’s part of a big repository that already has all this serving infra setup. You can do it really fast. I use a tool called Zeep. It’s not mine, it’s just like a cool thing out there that basically lets you take these reports and quickly deploy them. I think TFServe, a lot of these types of– They’re familiar with other deployment software There’s a standard interface, and it’s easier to set it up, whereas if you’re doing in barebones like Flask app, you have to think under the hood of like okay, can this handle load balancing? Things like that.

There’s help there. I think I’m kind of biased because I’ve gotten into the weeds enough. I’ve used stuff like Triton. We’re now starting to, like, at least at Banana,  we’re building out our own stuff under the hood because we’re hitting the natural bottlenecks, so I think maybe how to phrase it is like if you’re new to deploying infra, production, and all, I think it could be really helpful and then if you’re trying to really pack stuff to be faster than anywhere else in the market, you’re trying to go where no one else is going, you can pick these tools.

Ultimately, you’re going to end up beneath them all, and they’re all just like Python, C++ under the hood anyway. It’s all just bits running on a GPU at that point, and eventually, you start realizing that all these frameworks are just little abstractions, and you lose the ability to see them as like tools in the same way. Basically, as a starter, I’d start with them until, in general, I’d be like, if there’s a tool that solves your problem, use the tool instead of building it out, but you’ll probably have to go under the hood eventually if you have any sort of real scalability needs. Then you eventually have to build up the hardware at the companies I’ve been at.

Sabine: Right, so just in case there was anything to add on these inference frameworks in general, we have Ricardo, who is interested in knowing about your opinion on these. Like, for example, RayServe, TorchServe, KFserving, and the one you mentioned NVIDIA Triton inference server, are they useful? Do they just add unnecessary complexity? Are there any of them that are starting to shine over the others? Anything to add there?

Kyle: Yes, I haven’t used all of these, to be honest. I haven’t used RayServe. I’ve used TorchServe, KF, and NVIDIA, so I can speak on those. Personally, I’m a fan of Triton, I think in a production setting, I’ve seen Triton go pretty far. As in, I’ve seen it scale to teams of hundreds and pretty large production workloads, and I think one thing that helps with tools like that is like it’s in the NVIDIA ecosystem, you have GPU compatibility built-in, they’ve done some of them under the hood of kernel optimization. They’re working on it, so it’s getting better. I think that would be my default go-to if you want the best GPU compatibility.

TorchServe, haven’t used it in a while, honestly, so I can speak on it based on where I’ve gravitated to. I don’t feel like I have a global enough view to have a hard opinion on what’s better, but I would say Triton is a pretty good place to start for an inference server. Also, I guess I think a big thing is not just the inference server but the infrastructure that it’s deployed on like it’s the big thing.

Cool, you have maybe a barebones Flask app or a Sanic app or something that basically exposes a port, you’re able to make a post request, and we’re doing inference. That’s an inference server. I think what becomes a bigger thing is where are you deploying this. Is this just running on an EC2 instance without auto-restart? Is this on Docker? Is it on Kubernetes? Are you doing cloud-agnostic deployment? That’s where the reliability comes in more than the individual server.

I’d almost zoom out and say if you hyper-fixate on these individual servers and you’re not getting a win out of them, then it might be unnecessary complexity. A lot of the applications I’ve deployed in production were literally Torch or TensorFlow, we got the model, and I like, wrap it up in an app. Then, I’ve noticed that the biggest failure modes come from– think about it, it’s not the individual server. It makes your server run a lot. Now the problems become auto-scaling, cold starting, and load balancing. All those things are outside of the individual server. Let’s have a look at it.

If you’re getting a huge return out of it, keep drilling it, honestly, I wouldn’t fixate too much. I feel like the GPU game really becomes a lower-level game. There are two camps. If you’re trying to just deploy something, use a framework, use a tool, get out there, and have the reliability. If you’re trying to optimize a GPU production workload, then you’re going to have to just get in the weeds, and that goes below these frameworks. You’ll be committing pull requests to these probably. We need more people doing that.

Sabine: That’s all right. We did have a follow-up to the question before, any orchestration framework you would recommend for real-time inference?

Kyle: Yes, I’ll obviously plug Banana, because I think that’s one of the things we’re trying to solve. You’d have always-on machines, but like we’re working on a self-service tool, like trying to make it, so you just upload your model and then get back the ability to call it, and it would be real-time in the sense that it’s on a GPU it’s always on, it would cost more. If you’re doing serverless, you basically get extra latency there. I’d want to follow up.

There are some other tools I’ve seen. Again, I would recommend Zeep, if you’re trying to plug together a bunch of inference servers, I find that tool is really helpful. It’s cloud-agnostic. You can just click on a repo, deploy it, and then chain them together, the real-time nature is more about what hardware you’re running on and, like, does it support your use case? It’s not really a framework-level thing, in my opinion. The real real-time inference is something that you have when your inference is fast enough that it’s like 32 frames a second or whatever. You could do that on many different frameworks, like how you wired in real-time is the speed of your inference, which are two separate problems.

Other questions

Sabine: All right. We can take some questions from the chat. We have one from Andres. Do you have a take on cryptocurrency mining in the cloud, would it be feasible to deploy a cryptocurrency mining grid on the cloud and adapt the GPU load based on the complexity of the hashing algorithm?

Kyle: I’m not a crypto expert. I was in the scene early on, but then I’ve been doing ML for the last decade. I can give a little bit of context. Yes, you could. Typically, what I’ve seen is mining ends up not having the biggest ROI, the exception is if you’re doing the same model as Uber, which is you have an unused car, and so you rent it out, if you have an unused GPU and you do that for mining, that could be cool.

I think one use case that’d be really interesting, which I would love, I’d be a customer if anyone here builds this is, is find people who have GPUs that they forgot to turn off from the cloud and in the past, if you forget to turn the GPU off, mine something with it, make use of it, that’s an example. If you just deploy them, I think it’s feasible.

I haven’t drilled it enough to understand looking at hash rate and what’s your ROI, but I’ve tried the mining scene. Typically, you make something, but it has to be a really deep focus. If you’re wanting to do it, technically, with something like Banana, you could deploy a Bitcoin miner. I don’t know what cloud providers think about it, some would have to check the terms of service, but I think you can do it in the same way as machine learning.

Sabine: We had a question as well from Marcello, many times, ML models are available as PyTorch or TensorFlow models, but their performance increases when converting them to TensorRT. Does this always happen, or does it only apply to certain types of neural networks that have a take on that?

Kyle: I don’t know, to be honest, I don’t know the underlying working enough to say with competence, I would encourage you to go through, I don’t want to claim, I know this basically, but I would assume with stuff like recurrent networks, there’s going to be better optimizations cause you can cash weights for anything with a cyclical number, but yes, I really wouldn’t know enough. I haven’t worked with the underlying of these tools enough. I would just do speculating, and I don’t wanna waste anyone’s time with like a guess, I’d rather answer if I know.

Sabine: Cool, and I think we have time for one more question, which was from Piotra in the chat. Have you ever experienced any numerical stability problems when you downsize the model by reducing floating precision?

Kyle: That’s a complicated one. I want to make sure I understand the question. By numerical instability, do you mean exploding gradient? Basically, you change the weights, and then suddenly, the output just becomes garbage because of a propagating calculation through changed weights? You’ll get this. You’ll have this as an asymptotic behavior relative to the precision loss. What I’ll notice will happen is you go from 64-bit floats to a 32, there’s almost no instability, 32 to 16 tends to be okay, 16 to 8, you’ll start seeing it potentially and then 8 less, it just quickly drops off and becomes garbage. I don’t know all the math behind it, but in practice, I would treat it as an asymptotic bit that has a wall once you approach the 8-bit precision.

I think this also depends on network size, like the longer the network is, the more multiplication, and the larger chance is for error to propagate through precision. It’s especially a thing to give you context, it’s especially a problem in recurrent networks. Maybe I’m using recurrent incorrectly here, networks where you have feedback loops, you’re running something through multiple times before you’re passing it to the next layer if you’re doing repeated calculations. Basically, the more multiplications there are, that errors will grow.

When you think about it, the difference between 64 and 32 is very deep precision versus 8 bits to 7 bits or 8 bits to 4 bits is a huge gap, this huge information loss relative to the other type, so you’ll start seeing it faster. It’s very ad hoc. I’m sure there are papers that show the actual asymptotic downs on this. I remember seeing some. I don’t think I have a deep enough math understanding or the numbers to let you know exactly when it happens. 

Wrapping it all up!

Stephen: We’ve spoken a lot about tools today. I think we should bring it all together. Probably tie it all together. What are like your key tools for GPU model deployments, the ones you walk with and find useful?

Kyle: Basically, be Docker, Kubernetes, Pulumi. Those would be the three. If there were three tools that you could have on an island and you had to build a production stack, and you just had those three tools, those would be the three I would take. I know there are probably some specific container runtimes coming out that work with Kubernetes, like Containerd, I believe some others you might wanna explore, but you need container orchestration if you’re trying to do a production workload because you need to be able to wire it into in front that does like restarting logic and all that. That’d be my very simple, those three.

Stephen: I know we are running out of time, but I just have one final question to wrap this all up, to tie everything down. I come to you, I’m trying to optimize my GPU model for like inference. I come to you, and you walk me through the process you would take me through, from just having my model to being able to run real-time inference on GPUs and everything.

Kyle: We’ve got two paths, right? We have self-service, which we’re in really early MVP right now. You can try it out. I could follow up with a git-repo. Basically, you follow some instructions, and then you can upload your model to our servers. You’ll get back two lines of code that you can call it. You get an API model key and that’ll give you serverless GPU speed. The second type will help you. With people that don’t want to go through the headache of figuring that stuff up, we can set up a Slack channel, and our team will literally work with you.

We’re still early on so we want to have that hands-on experience of understanding individual customer pain points. If there’s a way to share a link after, it’s just like Banana ML and GitHub. We have a repo available. It’s a fully automated pipeline. There’s not a lot of front-end UX yet, mainly because our team is not a front-end team.

We’re very deep, deep R&D engineers and so if there’s anything you get stuck on, we’d be happy to help, but what I can say we currently offer is typically a 5 to 10x cost reduction because you don’t have to have machines on all the time. We’ve got people in like the autonomous vehicle space, we’ve got audio applications, image generations like GANs, and a lot of different types of applications running. Would love to hear about what you’re working on.

Is there a way for people to reach out? I don’t know, I could drop a LinkedIn, or if you guys have it but basically anyone here and also informal, I like just helping people with the GPU problems. Do you have any more questions? Just hit me off. I’m happy to be a touchpoint. This is the one thing we’re probably good at. I like using that to help the industry. I think it’s very behind in terms of production right now and so I could probably save you a few weeks or even months just with a couple of tidbits of advice if it wasn’t communicated in this conversation yet.

Sabine: Excellent. So, LinkedIn would be the place to connect with you? Is that right? And also through Banana and Slack?

Kyle: Yes, you can do that– I’ll have to double-check what my LinkedIn is, but I could drop it in here. I think, like Kyle J. Morris, you can share it however you want.

Sabine: It is indeed time to wrap it up. Thank you so much, Kyle, for being with us here, it was wonderful to have you.

Was the article useful?

Thank you for your feedback!