Neptune Blog

Your First MLOps System: What Does Good Look Like? With Andy McMahon

Stephen Oladele

26 min

25th April, 2025

MLOps

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.

Every episode is focused on one specific ML topic, and during this one, we talked to Andy McMahon about your first MLOps system.

You can watch it on YouTube:

Or listen to it as a podcast on:

But if you prefer a written version, here it is!

You’ll learn about:

1 What is an MLOps system?
2 What does a good MLOps system look like?
3 How to implement it?
4 How to scale an MLOps system?
5 And much more!

Sabine: We are joined today by our esteemed guest, Andy McMahon, and our topic will be “Your first MLOps system. What does good look like?” Andy, welcome to the show.

Andy McMahon: Thank you so much for having me.

Sabine: Andy, you have an educational background in some interesting stuff. Master of Science in Simulation of Materials and a Ph.D. in Physics, and then you got more into the machine learning side of things. You have a bunch of experience with data science and ML engineering. Currently, you’re the machine learning engineering lead at NatWest Group, Banking, and Insurance Holding Company. You’ve also published a book titled “Machine Learning Engineering With Python”, and you’re doing a podcast called “AI Right” Podcast. Is there anything you’re not doing in the space of machine learning?

Andy: Thank you. Sleeping is the main one.

What is an MLOps system?

Sabine: Fair enough. We do hope you get to rest every now and then. Okay, to warm you up, Andy, how would you explain to us MLOps Systems in one minute? We will time you.

Andy: To me, MLOps Systems are software solutions that basically allow you to do good operational practices for machine learning products. What that means, in a sense, is building ML solutions that are:

1 reusable,
2 scalable,
3 and reproducible.

Contained within that are several different sub-practices, some of which are very important, in particular, to machine learning software solutions, like:

Model monitoring,
How do you know your machine learning model is performing at the appropriate performance criteria?
How are you retraining?
How do you trigger retraining?
How often are you retraining?
Are you scheduling it?
Et cetera.

You then also have model management practices. You need to track and manage the metadata associated with your model artifacts and make sure that is clearly labeled and articulated. Then all of that has to come together in a sustainable set of practices and processes that have a very clear route to life within it so that you can take machine learning models from ideation through to production. That, to me, is MLOps systems.

Sabine: Excellent. That was just a few seconds over one minute, and it was very nicely encapsulated. Nicely done.

What does a good MLOps system look like?

Stephen: I also like to preface with trying to understand what good looks like because I think it’s one of the key things we’re emphasizing in the title. What is a good MLOps system? Especially when you’re trying to build it for the first time. Let’s start from there.

Andy: I think what’s really important for this is making sure that it makes your life easier.

The worst thing we can do as a community is build MLOps systems and solutions because we feel we have to. Just because it’s the latest fad or the latest trend, I should incorporate MLOps tools or build my own MLOps processes is in place. That’s not true. You need to understand that we are solving a particular set of problems fundamentally.

I think what good looks like is when you feel that your MLOps systems and solutions are making your life as a data scientist or an ML engineer easier. We’ll go through, I suppose, through the chat, the iterations that I can go through and how you can start small and scale up.

Fundamentally for me, you’re doing this well if it’s making your life easier. That can manifest in multiple ways, which you can dive into, but it could just be a developer experience is easier, but also you see an uptick in things like the DORA metrics from the DevOps world.

1 Is your time to live reducing?
2 Is the number of field deployments reducing?
3 Is the general performance improving?

These things are not just making your lives easier but also for your customers’ lives easier. Good to me looks like it’s something that helps you do ML more repeatably and scalably but also ultimately impacts your customers in a positive way.

How to set up a good MLOps system?

Stephen: Great. What does it take to set up a good system? At the high level anyways.

Andy: I think you need to break down the problem into its constituent parts. I mentioned some of that before, but your first system should always be, I think, relatively rudimentary. I’m a huge believer in bootstrapping your capability, as I call it. I’ve spoken about this in the past, so you shouldn’t go into this problem thinking, I want to solve all of those pieces I mentioned at the top of the call and one goal because you’ll do that for five years, and by that time, your business problems disappeared, your customer base is gone. It’s very important that you pick what’s the most pressing pain point for me, as a group, as a team, as a data scientist, and as an organization, and chase that first.

Your initial MLOps systems, in my view, should always be ones that do the very basics in terms of model management and experiment tracking first. You need to have some way of understanding the experiments you’ve run when you’re building the model. There are tons of tools that do this, we can go into specific tools later, but you need to really have a way of tracking the different experiments you’ve got.

You then need to have a way of tracking, as I mentioned before that the model artifacts you generate through those processes. You don’t just want to run a thousand experiments trying different hyperparameters. You also need to say, this is the best model, how do I store it somewhere in the target so I can use it later?

You then need to have a way of monitoring your ML solution. You need to start thinking, How do I know when the performance is drifting? What does performance drift look like for me? That can be very basic, again, it can be very much, you define one performance metric that you think is the most important, you then define some scheduled thing that goes and pulls relevant data, runs a simple query on it, and then outputs to a file somewhere. That’s you still doing MLOps, it’s not the most sophisticated approach in the world, but it’s good enough for that version zero.

May interest you

15 Best Tools for ML Experiment Tracking and Management

Best Tools to Do ML Model Monitoring

Then I think you need to fundamentally, just think as well, what are the practices you need to develop to keep building on top of that. Do you have the right software engineering capability in your team? Do you have the right understanding of integration points, et cetera, et cetera? I think you start small and then iterate up, I would say. Again, you should see the uptick in those different metrics that you’ve hopefully typified at the beginning of your MLOps journey.

Difference between an ML system and an MLOps system

Stephen: Is there any clear difference between me talking about an ML system and an MLOps system? Because the way I think is I just want to deploy something out there. I’m not thinking of any experiment tracking and anything like that, I just want to put a model out there. Maybe you can give that a clear distinction.

Andy: Absolutely, I think it’s a really good question, actually.

You can take machine learning models through to production or build a solution and think you’re not doing MLOps, but realistically, you’re just doing MLOps very badly.

What I mean by that is, you build your model, you ramp up in a pipeline, you’ve run that pipeline in some way, if you don’t have any tracking of your experiments, any tracking of the model artifacts, if you’re not monitoring the end result, you’re almost not doing MLOps, but the MLOps you’re doing is just the most basic possible, which is where I assume everything operationally is fine. It’s like MLOps version 0.00.

I think it is important that some element of this from the ground up, any ML solution to my main task has to have some element of MLOps in it. Now whether you disentangle that into a different system is an interesting question.

MLOps, to me, is a bit more general than just I know, as we’re focusing on systems tonight, but it’s also a set of holistic practices and a way of viewing the world right. It’s like DevOps was for software engineering, it’s just about understanding that the solution you’re building won’t just be built, and then you can forget about it. It’s a living, breathing thing.

That’s very particular in ML and machine learning, obviously, where we have retraining requirements, et cetera. You could separate it into different systems and have it kind of hook in multiple places in your ML solution, but to me, MLOps practices should be embedded within what you’re doing as an ML practitioner anyway. Then it’s just a question for that particular organization, team, et cetera, whether it’s separate systems or it’s just embedded within the tools you’re using.

How to scale an MLOps system

Stephen: I think we have a clear distinction now. You spoke about the very basic version 0.00. How would you differentiate the very basic V0 to V1? How can we start thinking that we are at V0, how is it different from moving to V1, V2, and then start iterating going forward?

Andy: Any problem you’re solving you will optimize in certain dimensions versus others. You only have finite time, energy, and effort to expend.

To my mind, version zero is about maybe dropping down on repeatability, scalability in a sense and just optimizing for understanding the basic principles and really going through the process end to end.

I often think that in any team I’ve built up or worked in, the first problem we go through, all of you probably on the call can sympathize with this. It’s not great when I look back, but the point was to go through that process the first time. In MLOps, what that means is just doing some basic exporting of your models somewhere and just solving the problem in any way. Again, it could be very rudimentary, but is it simply the case of the name of your pickle or the joblibfile telling you the model version? That can be very much version zero because what you’re optimizing the first time is what the entire end-to-end process looks like.

Then for me, version 1, 2, 3, et cetera is about starting to move the other way and upping the quality, repeatability, and scalability. It’s up to you and your particular use case, which you optimize first. I think just anything you can do to make it as simple as possible will help in all these dimensions. If the code you’re building’s modular, if the systems you’re building reference good architecture patterns if everything’s quite distinct and embodies separation of concerns, that’s often a good sign. Once you get to the most sophisticated, so version N, where N is quite large, I’d say you’re very much at the case where scaling from one use case to 1000 use cases shouldn’t scare you too much.

There are some problems maybe to work out there. Maybe the bill you’ll have to fit for the infrastructures concerning you, but you know that the processes and toolset you’ve put in place is one that can scale that way. And that’s the stage I am in NatWest as we’ve built our MLOps capabilities that way.. I’d say it’s about that, version zero is about optimizing, just understanding the process, building out the initial principles, and learning a lot. Version 1, 2, and 3 are about iterating on that and building something much more repeatable.

What should small teams prioritize while building an MLOps system?

Stephen: I think one thing this podcast focuses a lot on is the reasonable scale teams. In the second episode, we have this call with Jacopo, and I think it was similar, a lot of things we discussed building, just starting small, putting out something there that solves a problem, and iterating going forward.

Check the podcast’s written version

Setting up MLOps at a Reasonable Scale With Jacopo Tagliabue

In your opinion, if we are looking at a team that has six people, maybe two data scientists, three data scientists, one Ops engineer, and then they have just, say, three, four, or a handful of models of production. They are building starters.

What advice would you have for such a team? Just thinking about that problem first, and then thinking about what components they need to start setting up first just to ensure that they’re showing that immediate ROI before they start thinking about, “Oh, I want to build a bigger platform and house lots of models and scale up,” and things like that.

Andy: I was actually in a similar scenario a few years ago. When I started out, I was in a scale-up of 12 people. I was head of data science and machine learning, which meant I was in charge of just myself because I was the only data scientist. Very similar scenario, very resource-constrained, and had a few software engineers who would help. I think in that scenario, you have to really think about how to not reinvent the wheel. I mentioned doing things in a very rudimentary way. Thankfully, now, there are so many tools, and packages out there that you can do things in a rudimentary way in terms of you’ve maybe not solved all of the scaling issues you know you’ll come up against, but you can at least leverage what’s out there.

There are lots of great packages in Python, there are lots of great tools that have open-source or freemium models where you can at least get started. I’d recommend doing your research and understanding which of these can you leverage and which of these can you use in a way that means you build a minimal set of workaround that really leverages it as much as possible. Harking back to some of the stuff we’ve already mentioned, keep it simple as well. I believe this for ML model development as well, always start with the simplest case. If you can solve it with linear regression, don’t go to a neural network.

The same thing applies to MLOps systems. If you can solve it from a cron job and a Python script, do a cron job and a Python script first and then start probing it, understanding in the later iterations. “Why would that fall down, or cron’s not very stable?” “It’s got some issues, I should go this way.” Maybe move towards more sophisticated orchestration pieces or whatever the particular part of the problem you want to go after is.

One thing that is not covered there is that I think any ML team at that scale has to really focus on data quality upfront because that’s very intimately tied to the MLOps challenge. If you have very poor data quality, no matter how good your ML engineers and your AI engineers are, your performance is going to be all over the place. You’re going to be triggering incidents. You’re going to be retraining and debugging that model all the time.

That’s just not something that you can do when you’re that small. You can’t absorb all your time doing these instant management issues.

I think making sure the data quality is really good upfront is also an important one that I would say applies to any skill, but particularly when you’re small and when you’re very resource constrained.

Baseline tool stack for an MLOps system

Stephen: I would love to zoom into your early experience a little bit, before NatWest. What’s your typical baseline tool stack? You are thinking about this problem firsthand, and then you just want to put a few things together. What are those components you really prioritize just on a general level? Are there hidden blind spots that teams would often miss when thinking about the components they need to put together for their first MLOps systems?

Andy: Good question. I think we can often in this field get very attracted by the shiniest tools that seem to have the slickest videos or really cool demos and cool UI. That sometimes belies the importance of more fundamental things like you’re mentioning. For me, one huge thing that I always come up against and I always think is fundamentally important is orchestration. If you have a very clean orchestration layer, a very simplified orchestration layer, and Apache Airflow in particular is amazing for this. In my book, I talk about managed workflows with Apache Airflow, the AWS-managed service for this.

If you have that orchestration layer in place and you can schedule your pipelines and create the processes that will then trigger other processes, you can then start building very sophisticated things very quickly. Even if you don’t have a tool that has an amazing bias or explainability tool set or an amazing model monitoring capability, you can do what I mentioned before and have the basic Python script running. Something like Airflow, a really good orchestration layer, means that you’re still doing that from a very solid foundation and solid base.

Then eventually, you can swap out a simple Python script for a very fancy ML tool. I think my baseline tool stack is to solve your orchestration problem and then solve for me almost the other two I mentioned, the model management and model monitoring problem is really important. Again, just start small, and do that from simple Python scripts first. A very important one that is a blind spot, I think, is how complex it is to do model management. Things like MLflow, Comet, and lots of other tools, are solving a very acute problem. The quicker you can use something like that, I think you’ll find that it makes your life a lot easier.

Check also

Deep dive into the differences between Neptune and Comet, and Neptune and MLflow.

I’d almost chase after model management before I would monitor. It’s far easier for me to imagine how to code up some monitoring logic and vanilla Python than it is to build a model management piece of software. That’s a very complex problem. In the previous teams I worked in, that was always a challenge for us, as we didn’t have a tool necessarily off the shelf ready for us. We spent a lot of time building these horrible JSONs that tracked where our model artifacts were and what data we used for things. Say, if we can get orchestration, then model management sorted, everything else you can do in the first instance with quite a vanilla Python, it’s my feeling. Then you can build on that as much as you need to.

Solving the buying vs. building dilemma

Stephen: I think the reality of most teams is that maybe they hire one data scientist or an ML engineer to come and beat the full system. We have this argument in the community that platforms are not enough. You have platforms that claim to be able to solve the end-to-end problem, and so forth, and then you find that inflexibility. Do you have any argument against buying platforms as a system or something, especially for early-stage teams?

Andy: I love this question because it’s a perennial debate. I think it relates to what I said about the shiny new tools and the fixation we sometimes have.

I think tools, platforms, SaaS, PaaS, all of these solutions will only help if you know what you’re doing in the first place. If you subscribe to a silver bullet methodology where you think, “You know what, I buy this thing, I spend a million dollars,” or I’m a much smaller company, few thousand dollars or whatever, “I’m going to buy this tool that’s going to solve things.” You’ll just find that you’re facing the same challenges, but now in front of a shiny UI, and you’re burning through lots of cash.

Back to the point, we mentioned before, I would much rather teams go through building up what they can themselves, the exception maybe being the orchestration piece and model management piece. There are lots of open source MLOps tools that do this. There are lots of open-source tools that are able to help you with those things. So, I would say, see what you can get with open-source tooling. There’s also great open-source tooling for generally building ML and MLOps pipelines as well.

If you can get to a stage where you’re like, “Actually, there’s an acute need for something else,” then invest the money. If you put the cart before the horse, as it were, you’ll just burn a lot of money and be very disappointed because you’ve not solved the fundamental problem. The fundamental problems are often more process-specific and architecture-design-specific, but not really what’s the best tool. You can always spend more money on tools, but if you don’t stick them together properly, I think you’re going to run into trouble.

Bookmark for later

How to Build an Experiment Tracking Tool [Learnings From Engineers Behind Neptune]

How to implement an MLOps system well while being an early-stage team?

Stephen: Yes, and speaking about the processes, what are some practices that you think would enable these early-stage teams to think about these systems properly and properly implement them?

I watched one of your podcasts with the MLOps community, and there, you talked about the chasm between idea and production, and in the middle there, you have this bridge, this gap that needs to be filled. I think beyond just the tool which you’ve spoken about, there are also some practices that can make things work. Some, like culture, you can view as a team of try thinking the more systems properly. What are those practices that you think the teams should start thinking about at the early stage when thinking about systems?

Andy: I’m glad someone watched that podcast. The thing that I often come back to is what I call the four P’s:

1 people,
2 process,
3 pattern,
4 products.

Just covering them very quickly. I think you can never be too early thinking about this.

On the people front, and we alluded to this earlier, we should always avoid thinking there’s a unicorn person out there who can do everything we need. We need hybrid teams, blended teams that have very complementary capabilities. You can do that with any skill. As long as you have two or even three people, you can still get that blend of to have the software, engineering knowledge, the ML knowledge, and then something maybe in the middle, or a translation layer towards the business, et cetera. People are such an important part of that. Who are the people you have, do they complement each other and work well together?

Product is about just really what this whole podcast is about, ensuring you understand that you’re building intense systems that eventually impact customers, and how are you going to think about that differently from just a normal piece of prototype code? Well, you understand the products, people expect them to work. That implies that you should be testing a lot. Do you have testing processes in place? Are you already thinking about unit testing, integration testing, or regression testing? If you’re not, start thinking about them because that’s the only way to build scalable and usable products.

Then also, think about the user experience. The user, in this case, might not be an actual person but maybe another system. There they have a clear interface and clear contract to consume from. It could be something as simple as, does the other system have access to the same S3 bucket, or is done by results? That’s the sort of thing you sometimes have to think about in that product space, but then to your particular question around process and patterns, which I think is really linked.

Pattern, for me, is about, whether you are using really well-known architecture patterns, or at least ones that make sense, using microservice architectures, or using architectures that are already there and used by some of the best companies out there?

Then on the process side, do you have clearer development guardrails? Do you know how to develop high-quality codes? Do you at least know how you’ll improve the quality of your code? Can you do anything you can to automate? The earlier you can embed CI/CD practices. I think anything, GitHub actions are a great example, Jenkins, and all of these other tools for having CI/CD servers in place means that that process can go faster and faster.

I think the earlier you think about all of these things together, you’ll start doing the right things that then put you in place for the future. When you’re challenged more on issues like massive scale structure of things like account security, networking, et cetera, this can come a bit later.

Those four P’s for me is fundamental that you should always think about for any team really, but it’s especially pertinent into ML and MLOps teams, I think, that people, pattern and process and product viewpoints.

Skill set required for building an MLOps system

Stephen: I think if you speak to teams about this, most teams would agree that it’s really hard, just linking these four P’s together and just trying to coordinate around the people, the process, the product, and the pattern itself. How do you think that teams can appropriately achieve this? A good follow-up question as well to that is, who should I hire for us to build up that system, my first MLOps system? Should I hire a data scientist or an MLOps engineer or an ML engineer or stuff like that?

Andy: Good. In terms of who you should hire first, the challenges there are, it’s a trick question. If you’re hiring one person, you’re already in unicorn thinking, which I think we should avoid. If you’re hiring two people, which I would always recommend, a minimum viable team, at least, I think you need someone with a good data engineering mindset. As I mentioned, data is super important.

Then it could be a data scientist, ML engineer, MLOps engineer, it doesn’t matter what they call themselves, but I think someone that complements that data knowledge quite strongly with the knowledge of pipelining, for example. How do you build ML pipelines? How do you build MLOps pipelines? By which we mean all the things we mentioned before, something that runs into some monitoring, something that checks what model version to build, but that will require a few basic things.

They’ll need to understand models and maybe even build those models or use the off-the-shelf model. Even if it’s an ML engineer, but they’re reusing hugging face models, that’s absolutely fine as well, but there needs to be someone who understands models because how else can you build the monitoring logic behind that and understand what you’re doing with model artifact management.

They also need to have enough software engineering capability that they can start building these systems that are robust and reliable. That’s the whole point of Ops and MLOps, is, you’re not just doing a flash in the pan, you’re building something that has to work again and again and again. You really need that software engineering capability there as well. I think, how can they coordinate? That is, it’s always a challenge, but I think splitting it out into those four P’s helps me rationalize it often and always break down the problem.

From the people side, we’ve just discussed that we’ve got the complementary capabilities, the cover-off, the key things, and patterns. Again, leverage what’s out there, don’t reinvent the wheels. AWS has their architecture LENS framework, I think it’s called AWS LENS, where they publish a lot of really well-detailed architectures. Even if you’re not on AWS, you can at least see them and see the different components and how they interact together. That ticks off patterns.

Product is really the end goal, constantly iterating towards the business goal, but just always thinking about reliability and robustness, not just breach testing. Then, in terms of process, it’s back to that point of it starting small and iterating. Go through the first cycle, constantly iterate working, and you improve. A lot of those problems will not be new problems, there will be problems solved in software engineering. Leverage the software development and software engineering ecosystem as well.

Is simple Ops the right first step?

Stephen: This is something that is quite popular in the MLOps community, and the thing is to keep the first model simple, or you should try to get the infrastructure right, especially when you’re trying to deploy your first model or just deploy your first iteration pushing it out there. Can you elaborate on something like this, this particular statement?

Andy: Yes, definitely. I 100% agree with this, we should always start simple Ops. The key difference that I maybe alluded to earlier between just doing some research-based data science and machine learning versus building a product with MLOps at its core is that you are thinking about it as something that has to work again and again and again.

Your simple sklearn model that does some regression, you could take one of them, the Boston housing data set, that’s a very simple thing, lots of tutorials on that. Building that ML model is really easy. What is difficult is if you start saying something like this can be particular to the business use case, but how am I going to serve a request to score with that across 50,000 or 100,000 users? How am I going to run that, maybe as a batch or maybe as a live microservice that can be requested by API?

I think all of that may be flavored by the business you’re operating in.

If you know you’re supporting a customer-facing web application, you’re maybe going to naturally go down the REST API microservices route.
If you’re servicing a very large organization, as a lot of overnight processes, like we often do, you’re maybe thinking more in a batch way and thinking about using far more scalable technologies like PySpark, et cetera.

Just mapping out what your business challenges are going to be then automatically starts helping you make architecture and design decisions.

Then the model piece becomes, again, something that you can always iterate on, but fundamentally, it’s probably relatively simple compared to these other choices you’ve had to make. Then you start thinking, “Right, how do I set that up in a minimum viable product fashion? How do I stick it all together back to the orchestrator and make sure that it’s all running at the right time, et cetera?” Definitely, I agree with that, and I think always draw back to the business problem you’re trying to solve.

That also drives your operational considerations and what MLOps looks like for you.

Again, if you’re running a big batch process every night, do you really need some sort of complicated live streaming of metrics for your model? No, that’s overkill. You maybe just need, again, a nightly report that runs just after your batch production.
If you’re doing a very scalable, customer-facing application, if you do need some more real-time metrics, and you maybe also need to resurrect and be aware of some of the metrics or just classic from DevOps, who’s the memory, the CPU utilization, all of these things, and not just what’s the recall of my model.

Then another challenge that, this just came to mind, I think it’s important as well, that comes from the business problem is, you’ll also be able to put constraints on those processes in a different way. Something I’ve come up across quite a lot is the business once they understand what you’re trying to do with MLOps will often say, “Right. I want to know how the model’s doing every single day.” I’ll say, “Right. How often can I get the truth data for this model,” and they’ll say every month. Automatically, there’s a disconnect between the business, the technology, and how it’s all implemented together. Just always drawing back to the business problem really helps hone that in a bit and understand what choices you need to make.

Busting myths about MLOps systems

Stephen: Curious, are there any misconceptions in the MLOps community about MLOps systems that you don’t agree with? Let’s trash it out here.

Andy: Oh, that I don’t agree with? I think the tooling obsession annoys me a bit. I think we do forget as a community the importance of just good design, good processes, and good software development techniques. We often get obsessed about the latest demos and the latest big announcements, and I do it as well. I’ll sign up for about 10 webinars that I’ll never attend on all these different technologies because there’s a new release or a new version, but I think we often forget just how relatively simple the problem we’re trying to solve in MLOps is.

To my mind, there are only a few different pipelines you’re building,

1 You’re building your training pipeline to reach in the model.
2 You’re building the inference pipeline to bring out the results.
3 And then you’re building an MLOps type link to do the other bits.

That’s it fundamentally. I think I do sometimes dislike how we will oversell the importance of specific tool choices.

You should very much be comfortable swapping out tools as you progress through your journey, as they solve slightly different flavors of the problem.

Your model management software starts with the open-source version, then you say, “Actually, I want the benefit of being supported at an enterprise level, so I’ll switch to a paid model with this provider,” but it should not fundamentally change the design you have. If your design is tightly coupled to your tool choice, you’ve made a massive error I think, because it should really be a swap out. You’re just doing a different API call, or you’re just writing to a different location.

You shouldn’t be so tied to a product that you suffer from lock-in, which is one of the other dangers you can have as well. Either with cloud providers or specific tools, you can just become so wedded to it that when you have to change because the companies went bust or the tool’s no longer available because of major upgrades, you have to fix so much technical depth.

I think that the big bugbear for me is the obsession with tooling. I am maybe being too harsh, there are a lot of people I know who work in the community on really amazing tools, and there are amazing tools out there, especially in the open-source community.

I just think as practitioners trying to build these solutions for organizations, we shouldn’t just think there’s a silver bullet out there. We really need to bring it back to basics while the processes we need to develop how we’re making sure they’re robust and monitored, then we have good metrics for their performance, and then just work against that.

Role of MLOps in scaling a system

Stephen: We have a question from the MLOps Community. This person says, “I’m working towards building a restaurant recommendation system that provides the restaurant’s business similarity between two people’s tastes. I’m planning to deploy it as a web app. How should I proceed towards this, knowing that I’ll be scaling this to 50 or more users? Then how does MLOps come into this particular scenario?”

Andy: It sounds like this person’s thinking of a very particular use case which is great for bringing this to life. If you’re building a web application that’s going to have 50 or 50,000 users, and you have to run this ML process in the background, this recommendation engine. What’s important to my mind at the beginning is not putting all that together in your head because you’ll be a bit overwhelmed, and you’ll probably try and solve it all at the one time and create some spaghetti code or something that’s not very modular.

How to Test a Recommender System

If you separate out all those pieces, you can start breaking down the problem and understanding how to solve each one. The front end, right, how you’re going to scale the front end to 50,000 users? This is done all the time, so look online and see how that’s done for general web applications, that’s not something new. You have the frontend system, you have the application database that stores the right data needed to run the actual web interface. Think about that user experience, get good UX design in place if possible. That’s a solved problem, but that’s the first piece, that’s only the entry point to the rest of the solution.

You then have to run your recommendation engine updates, your retraining, your re-running and I’m not a recommendation engine expert at all. I’ll just assume that’s a black box in a sense. You fundamentally have this process that you need to run a very large scale that can often be very computationally intense. Put that on one side.

How do you solve that problem, and how often are you going to run those updates?
Do you have the infrastructure you need in place?
Do you need to think about things like auto-scaling for particularly chunky or compute updates et cetera?
Do you need to think about moving to the cloud in order to service these questions?

I think you think about that recommendation engine update as its own process and separate that as well.

You then have the interaction between the two. This is something I love talking about is interfaces and contracts.

What’s the contract going to be between your frontend and your recommendation engine?
Is it a direct API call to some basic Flask app or something else that just surfaces the results of the recommendation engine?
Is it going to be something a bit more complex?
Is it actually going to be that a recommendation engine can work in a batch offline modes and the web application just needs to pick up the results from some S3 bucket or other location?

Then where MLOps comes into really is making sure all that hangs together from the ML point of view. The recommendation engine,

How do you know that’s performant?
How are you going to check in on the status of that?
Then what actions are you going to take based on it – that’s your model monitoring process?
How often are you going to run that as well if it’s a nightly batch process?
Do you run the MLOps pipeline every night as well to check the monitoring performance or do you run that less frequently?
How do you manage the actual versions of the recommendation engine as well because you might want to do rollbacks if something goes wrong? You start thinking about that as well.

Then finally, I think in this scenario, orchestration comes in through again that decision about it is a dynamic request that triggers an ML process. In which case, you’re thinking about event-driven architectures things like Kafka and Pub/Sub architectures. Is it actually again it’s really about retrieving results on the back of a user request? It’s on a batch schedule which case you could do a cron job or some other scheduler or go back to Apache Airflow, which I mentioned earlier.

I think the key thing is breaking it down into those constituent parts and then working out how you solve each of those problems individually. Then which are the most pressing problems you’re not sure how to solve go and get the results and help you need to understand that.

For me, the bit that I would be less comfortable with is the frontend. I have no UX skills, and no understanding of how to build a good front end at all, so I would need help to do that. The other pieces, I’d probably know. MLOps is really about managing that back end and just making sure it’s monitored, looked after, and then retrained appropriately when necessary.

Building a strong foundation for future scalability

Stephen: Yes, and following up to that particular question, I think one of the challenges when building your first ML system, for example, is that when you want to scale, it literally just breaks. Especially if you don’t take that scale into account, your system just breaks apart. Maybe you’re running a cron job and a Python script, and then you don’t know how to handle 50k requests, 100k requests because all of a sudden, the business has grown. How do you start thinking about scalability at the onset when building a good first MLOps system?

Andy: There are some choices you can make earlier to help with these things. If your problem lends you to, say a batch process type architecture, or at least some element of batch processing, doing things like building everything around PySpark, for example, means that scalability is really a question of how much infrastructure you’re willing to pay for.

I’ll come back to AWS just because it’s the one I’m most familiar with, but it applies to the other cloud providers. If I use their own cloud smart clusters, the Elastic Map Reduce clusters, you can start putting in things like auto-scaling policies and scaling up that infrastructure as and when you need it, and the fundamentals of your code don’t need to change. I think that’s a decision you can make early on because I can run PySpark on my laptop, it’s probably not much use, it’s a very small cluster, but I can also run it on a 10,000-node cluster if I have the capability to pay for it, so even though choices like that.

If you think more about the microservice architecture we were talking a little bit about before, you start thinking about things like load balancers.

Do you start bringing in load balancers?
Do you have the expertise for that?
Do you understand how to write this traffic appropriately, the networking, and the questions that come from that?
Then are you able to spawn up the processes you need to maybe run your ML model?

Then I would start leveraging things like maybe cloud functions or Lambda as it is in AWS, so very lightweight pieces of code that you can run in an extremely scalable fashion where you don’t have to think about that underlying infrastructure.

I think in general the cloud just helps with scalability, you pay a little bit of a premium per unit but you just sleep better at night because you know scaling is very much easier there. I would always recommend that you at least explore and understand the options available in cloud. Then if you are building in a more on-prem or local way, you at least do know whether you know you can port up to clouds.

A great example there is PySpark. Even if I’m running on my laptop but building everything in PySpark porting it up to use a very scalable cloud service later is not a big deal. Whereas it would be a big deal if I had written everything in vanilla Python and serial Python, and then I had to refactor for scalability. There are some choices and thoughts you can make early in the process that should help, I think.

Timeline for the entire project and managing expectations

Sabine: We have a question in chat from Penny Johnson. Penny is asking, “Can you give an insight into actual timescales in the field from model ideation to solution delivery, monitoring cycles, and improvements? Also, how do you manage the business expectations for these?”

Andy: Oh, great question. This is what my job is. I was worried about this. I’ve actually just finished a webinar on the work we’ve done on decreasing our time to value in that way, so I can mention some of the figures and things now because it’s in the public domain.

Typically for us, we have a finding before we adopt some of the basic practices on tooling, one, on the cloud. We were roughly a year to get a model from ideation to production. Now, that is long, and the big factor there for us was being in financial services, there’s a lot of governance and all the things we have to go through.

In my previous job, we were delivering a model every quarter, roughly, and I would say that’s more feasible, for every few months taking something through ideation to production. If you’re talking about iterative improvements on models rather than a full, which I think, Penny, you’re asking about the whiteboard through to a solution, iterative improvements, I think, can be a sprint or sub-sprint level if you’ve got good CI/CD practices in place.

We’ve now been able in NatWest to get that down to around three months, that once-a-quarter level, for any particular team because we invested time, energy, and effort in building out an MLOps platform that used SageMaker and surrounding ecosystems. That was a case of harking back to what I mentioned before. We understood how to do the processes well first, then we understood the fundamentals and what good design looked like. We upgraded everything and were able to refactor all of the internal processes as well for that.

I’d say, for me, that a once-a-quarter piece is reasonable for most scaled organizations. The scale factor comes in for larger organizations because they can do a lot in parallel, so, for a smaller company, once a quarter means literally one ML model for the company, for a huge organization like NatWest, it might mean hundreds per quarter. MLOps building, et cetera, should just be part of that process, so as long as you’ve got the understanding, the design, the architecture in place, you should be able to also incorporate that into that once-a-quarter cycle. That’s just my view. I think there’ll be a million different views in this across the piece.

Managing expectations is the fun part, so I think you’ve got a few challenges you need to overcome.

One is really making sure that your stakeholders, your customers, your colleagues understand the benefits of machine learning in the first place, but they also understand why MLOps is important. It’s one thing to solve a problem using a machine learning algorithm.
The next thing is to make sure you can solve it every day for the rest of the time, and that’s where the MLOps piece comes in.

You need to win hearts and minds so that they understand why you’re investing time, energy, effort, and money into developing these extra bits of the solution, the monitoring capabilities, the model management pieces, et cetera. I think you really need to do that. Then they understand why you’re investing all of that in those extra pieces, but again, it comes down to just simplifying, making sure they understand the basics of what you’re doing and you’re constantly updating them and making sure they understand when you’re running into issues and where the bottlenecks are. That means you can then iterate on that for your next set of projects.

I’ve had to do that many a time where we thought we’d deliver in three months, and then it’s taken a lot longer. As long as you’re clearly communicating with your stakeholders, they’ll understand that those expectations are shifting, and they’ll buy into that, I think. That’s a really great question. I think it’s one of the most important challenges is stakeholder management.

Fitting retraining scenario in an MLOps system

Sabine: Yes, for sure. It’s not just about the tech stack, but sometimes, it’s about the people and communications and all of that. We have another question from Nabil Belgasmi. “If we want our ML models to be retrained automatically on new data, what is the impact of this requirement on a simple MLOps workflow?”

Andy:

If you want it trained on new data every single time, first of all, you could challenge that assumption, “Do I really need it trained on new data every time, or do I just need it trained when there’s a shift in the distribution of the data or when there’s a drop in performance”.

Let’s assume the build for your question we’ve settled on when new data is in we want to train the model. The downstream impacts on your MLOps processes and system are going to be that, okay, retrain the model, but what do I do with it? What is your process for determining if it’s the actual model that goes through into production?

What you don’t want to do is just automatically push it to production. That’s the first got you, because it could be retrained, and it’s really bad, terrible. It’s basically absolute garbage. You push it through production, everything goes down, so you need some mechanism to view the performance of that newly trained model, not just the one already in production. That’s again where your model management and other tools come in. Can you try the appropriate metadata and the metrics for the training run?

I think what’s important as well, if you are thinking about pushing a specific model into production, is, “Have you actually simulated production-like conditions? Do you have a test environment setup that runs in the same way as your production model?”

Actually, I missed this earlier from your question, Stephen, but I’ve seen a lot of developer models with a particular set of assumptions. They have five years of data, for example. They’ve done their training, test, and validation split. Then they think it’s going to work in production, but actually, in production, you get a thousand new records every single day, and they don’t know what the fluctuation of behavior is going to be like. I think you need to make sure that if going to Nabil, retraining, and then pushing a model into production, you have some testing that’s available that shows how it will work on production-like data coming in at the cadence and the frequency that it will.

Then, underpinning all of that, back to the process point earlier, is you need a good MLOps process in place to say, “That’s actually okay. That fits within our operational risk profiles. That’s on the governance control.” Whatever the mechanism is. You need basically a way to say, “Push the button. Push it into production.” I would say all of that has to be factored into what your MLOps system looks like and is capable of doing.

It’s a really good question. A lot of people come up against this very quickly, and I think the most important pieces there are operating or testing in some sort of production-like environment and then having a good process for saying, “Everything’s okay. I can now push it into production.” Things like blue-green deployments, as you might sometimes hear about if you give that a google, it talks about how you can run basically the two solutions in parallel, but one’s in an air-gapped environment. Then, once you’re happy, just swap them around seamlessly. Building in processes like that is often a really good MLOps practice as well.

Was the article useful?

More about Your First MLOps System: What Does Good Look Like? With Andy McMahon

Check out our product resources and related articles below:

How to Learn MLOps in 2024 [Courses, Books, and Other Resources]

MLOps Journey: Building a Mature ML Development Process

Building MLOps Capabilities at GitLab As a One-Person ML Platform Team

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

What is an MLOps system?
What does a good MLOps system look like?
How to set up a good MLOps system?
Difference between an ML system and an MLOps system
How to scale an MLOps system
What should small teams prioritize while building an MLOps system?
Baseline tool stack for an MLOps system
Solving the buying vs. building dilemma
How to implement an MLOps system well while being an early-stage team?
Skill set required for building an MLOps system
Is simple Ops the right first step?
Busting myths about MLOps systems
Role of MLOps in scaling a system
Building a strong foundation for future scalability
Timeline for the entire project and managing expectations
Fitting retraining scenario in an MLOps system
Other questions

Transition Hub

Train FM

State of Foundation Model Training Report 2025