Neptune Blog

Setting up MLOps at a Reasonable Scale With Jacopo Tagliabue

Stephen Oladele

21 min

23rd August, 2024

MLOps

This article was originally the second episode of MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.

Every episode is focused on one specific ML topic, and during this one, we talked to Jacopo Tagliabue about reasonable scale MLOps.

You can watch it on YouTube:

Or listen to it as a podcast on:

But, if you prefer a written version, here it is!

You’ll learn about:

1 What is a reasonable scale MLOps
2 How to set up MLOps at a reasonable scale
3 What tools to use and whether to buy or build them
4 How to deliver models to customers
5 What are the limits of reasonable scale
6 And much more.

Let’s kick it off.

Sabine Nyholm: It’s our pleasure to introduce Jacopo Tagliabue, who has even been called the father of reasonable scale MLOps. Definitely a machine learning rockstar. Jacopo, could you introduce yourself a little bit to our audience?

Jacopo Tagliabue: Of course. I’m currently the director of AI at Coveo. For those of you who do not know Coveo, Coveo is a publicly-traded company on the TSX. We basically provide machine learning models to customers to serve different types of use cases. The thing that is the most is eCommerce because I come from there.

The idea is that let’s say you have a website and you want to power your website with smart recommendations, then you will ask Coveo to provide you with a model to do that.

This is important not only for telling you a bit about myself, but it will be important in the next hour to tell you where we come from with the reasonable scale ML.

We’re a B2B company. We do not have customers using models directly, but we provide customers with models to use on their shoppers or something like that. I think that’s an important distinction to be made.

I joined Coveo due to the acquisition of my company, which was called Tooso, which was doing pretty much the same thing, natural language processing and AI for eCommerce in San Francisco, with my co-founders Ciro and Mattia. So I’m a former entrepreneur as well.

I also like to be part of the research and academic community. I am an adjunct professor at NYU, where I teach machine learning systems, so actually how to put together all these machine learning pieces to make them work. And I try to contribute to the eCommerce tech community as much as I can with my team – with papers, open-source, and data.

What is reasonable scale MLOps?

Sabine: Excellent. Thank you so much, Jacopo. Just to warm you up Jacopo, how would you explain reasonable scale MLOps in one minute?

Jacopo: Sure. A lot of the ML guidelines and ML practices that we read every day in blog posts or out of a business review, or whatever, are “Google did this, or Facebook did this, and you should do this”.

Reasonable scale ML is the idea that if you’re not Google, then it’s okay. It’s okay not being Google. Actually, the vast majority of use cases of ML in the near future are going to be inside enterprises or in companies that are not necessarily big tech.

That’s the context for a reasonable scale, but the exciting part of a reasonable scale is that thanks to the growing and blooming ecosystem of open-source tools or solutions (like Neptune, for example) that are now available for everybody with a very low barrier to entry, it’s possible to do cutting-edge ML at reasonable scale. This is a huge difference because four years ago, when we started Tooso, my own company, a lot of things that now we take for granted were not there.

You needed huge resources to make that happen, but now for the first time:

1 if you start with a small team
2 or with not so many resources,

you can still make a very good ML. Again, thanks to this incredible ecosystem.

What we’re trying to do is to evangelize people. This is not even our own business, but we’re trying to evangelize people who will work for us to help them go out of this idea that doing ML in production is super complex. It is if you don’t have the right tools. It is not if you know what you’re doing.

Stephen Oladele: Just for a follow-up, Jacopo, I’m wondering what made you start thinking about reasonable scale? What’s the story behind that?

Jacopo: As we said before, remember, B2B companies are very different than B2C companies. A typical B2C company is an ML stack for Airbnb, or let’s say Amazon, they have one recommender system (to simplify), and they have one website to handle. It’s a very huge and very important website and very hard, but at the end of the day, they control this website end-to-end.

Every time they make a small improvement in the recommender system, they’re going to pocket the difference in whatever money they’re going to make, which in the case of Amazon is billions of dollars with justified the investment of doing that.

B2B companies like Tooso or Coveo have a different way of growing. We grow by adding customers. Our ML needs to be, of course, good, but we need to optimize not for one specific client but for robustness across literally hundreds of them.

When you go to each of these customers, each one of them is not going to be Amazon, maybe all together, they’re terabytes and terabytes of data, but each one of them is a reasonable scale customer, has millions of events per day, not trillions. The idea of the reasonable scale stack came from the realization that most of our business problems are per customer or organization, and most organizations in the world are reasonable scale organizations.

Most organizations worldwide are not Amazon, as a matter of fact, and we want to help them as well. There’s no reason why you don’t have good ML just because you’re not Amazon. That’s the TL;DR tagline of all this.

Setting up MLOps at a reasonable scale

Hyperscale vs reasonable scale companies

Stephen: Talking about Amazon, Google, and so forth, these are hyper-scale companies like you’ve mentioned. What’s the biggest difference or differences you’d say if you are setting up ML for a small company of, say, four engineers or two data scientists versus an Amazon scale, what are the differences you could point out that would help?

Jacopo:

The first one, which is going back to my previous point, is a difference in incentives. There’s a difference.
- If you’re in Google or at Amazon, there’s an incentive to be marginally better at something because its 0.1 improvements on X will translate into billions of dollars.
- This is unnecessarily true for a reasonable scale company. On the incentive side, the way in which you structure your team, for example, how many data engineers, how many data scientists, and how much machine learning research you have, depend on this business constraint.
The second point is the build versus buy versus open source versus whatever you want, but it’s the idea of how you use your resources.
- If your resources are constrained in some way, you really want to spend the vast majority of your time, possibly all of your time, doing high-value, high-margin activities.

For example, if I’m a data scientist building your recommendation model in Coveo, I want to do recommendation models. I don’t want to deal with infrastructure, scaling, experiment tracking, you name it. We can name all of them. It’s much better for me to buy or somehow leverage something else that is already very good in one of these things, which is not my core. They’re necessary to do my job, but it’s not what I do, and then I can focus on the best type of things.

At humongous scales like Google and Facebook, they already solve all of the ancillary problems because they’re so important to them. They may not choose other providers because there are some peculiarities to their level of scale. Most reasonable companies are actually similar in that respect. As in, if you use Metaflow or whatever’s Snowflake at Coveo, I’m sure you can use it for a very similar company with mostly the same patterns.

Reasonable-scale-vs-hyperscale — *Comparison of “reasonable scale” and “hyperscale” companies*

Tool stack of a reasonable scale MLOps

Chris: Jacopo, I appreciate your definition of B2B versus B2C. We sell to other businesses, so we call ourselves B2B, but really we are B2C. Those are just customers. We have our own product, and we have an interface. (I say that to frame out or help you with the answers.)

We’re at phase zero. I’m a machine learning engineer brought in only a couple of months ago that spent all my time doing software engineering so far, but I am tasked with trying to figure out what I think is the most important piece of machine learning experimentation as a small data science team continues to try to advance the three commercialized data science models we have today.

The big question is, we’re an AWS shop, and so I’m struggling with the MLflow, SageMaker, and Neptune platform paradigm to get started.

Jacopo: You’re talking just about the experiment tracking part of the entire pipeline? Do you just want to focus on specimen tracking?

Chris: I got to take small bites. I think it’s a mistake for a team of three to try to do it all.

AWS would like us to just sign on it and do it all. I’m not yet sold on SageMaker experiments.
I’m not yet sold on the cost of a small company taking on the resource requirements of MLflow, although we have that stood up on Kubernetes on AWS.

We’re there. Small bites. Right now, we’re talking model experimentation, get that under control, get better visibility inside the team of what each of us is doing to our primary core components.

Jacopo: I will go back to the general point afterward, but let me answer this specific question first. For experiment tracking, I am familiar intimately with Neptune, Weights & Biases, and Comet products.

Bookmark for later

→ Check an in-depth comparison between neptune.ai and Weighs & Biases and between neptune.ai and Comet.

→ Read about lessons learned by engineers behind neptune.ai when building an experiment tracking tool.

I think they cater to a slightly different type of users, but the general suggestion for me is that I pretty much preferred the SaaS experience that Neptune provides compared to MLflow.

Not because MLflow is not great, we used that before as well, but because it doesn’t justify the additional maintenance and cost of ownership for such a feature in our team, which is heavy on cutting-edge experiments stuff and not so much on infrastructure and bare bone maintenance of stuff.

I think the very first choice here you have to make is between something that you host yourself and one of these tools.

In my opinion, this is a no-brainer SaaS thing. This is one of the things that especially small teams should go SaaS with, but I understand that some other people may have different constraints on security or all sorts of other problems. If you ask me, this is SaaS, this is 100% buy. There’s no universe in which this is building yourself kind of thing.

Chris: Yes, I think phase one security is our biggest concern because it’s all PII in the sense of the company’s customers and customer’s customers, and so it has to be behind our VPN, and we hadn’t opened that conversation with Neptune yet on costs and complexity to get that setup.

Jacopo: I super totally understand that. As far as I know, with all these solutions, you can start – for the experiment tracking part, there doesn’t seem to be much sensitive information that you need to send to get value out of this tool, actually. It’s more loss metrics, aggregate metrics, and evaluation. I think for all of these tools, even a SaaS adoption can be a journey.

When you start with something that is light on the security side but still provides value for your experimentation cycle. Then the more you build, the bigger appetite you build up. I think there’s a path here to have a short adoption circle, see how it goes, and then postpone the security discussion for when you need to upload artifacts or data or something like that. Does it make sense?

Chris: A lot. Thank you, yes.

Might be useful

Many SaaS tools can also be deployed on your server. Check the on-prem offering of Neptune.

Jacopo: Another thing that I want to say about AWS is that there’s a tool that we really super suggest (it’s open-source, so nobody’s getting any money out of that), and it’s called Metaflow. We actually have several open-source repos on GitHub that show you how to build an entire system end-to-end. For people using AWS – we always find Metaflow is a good backbone on which to put on top, like Neptune and then SageMaker for serving or Seldon or whatever you need.

My suggestion is if you’re already sold on AWS, it doesn’t mean you need to go the entire SageMaker way, you can choose some other open-source tool. They’re very good at making you productive.

Sabine: Yes, we actually have a follow-up question in chat about Metaflow. Is it ready for Kubernetes or not?

Jacopo: I cannot say officially because I don’t know the latest release. I think I can say it’s in the works. I don’t know if it’s already out or if it’s in beta. What we use it, though, is with AWS of the shared computing. For us, it’s not a big deal because we’re using that on top of the AWS batch in AWS Lambdas, which are already part of our infrastructure.

I think the Kubernetes thing is interesting because many people use that, of course, as they view the existing backbone of their infrastructure.

I think there’s a nice post by Chip saying that data scientists don’t need to know Kubernetes. I will go one step forward, and I’m going to say a lot of ML teams don’t really need to use Kubernetes at all.

At a reasonable scale, there are as good solutions as Kubernetes to run your computing, training, and serving, probably with a fraction of the maintenance headache of actually going to Kubernetes.

Irrespective of what is the status of Metaflow and Kubernetes today, and I’m sure that’s going to improve, we’ve never had any problem actually running Metaflow with using AWS past services, and we’re actually very, very, very bullish on the idea of using these past services instead of maintaining a cluster ourselves.

Sabine: What are the most costly parts of the MLOps tooling and processes for you, whether it’s from a time compute storage or other standpoints?

Jacopo: That really depends on what you do. For us, it’s compute, in particular, AWS Patch, in particular, GPUs. For my team, in proportion, this is the biggest part of our bill.

But what my team does, it’s either research, cutting edge modeling or prototype, also somewhat cutting edge, but still deep learning stuff, a lot of data, and so on and so forth. The lion’s share of our bill would be like that.

The other important part, which tends to be now less expensive, is the Snowflake component. All of our data is stored in a data warehouse, for us, it’s Snowflake. We call it a single source of truth because we store everything there by design.

When you query a lot of terabytes and terabytes of data, even at a reasonable scale, at some point, the bill piles up.

What I want to say, though, is that as much as GPU, Snowflake, or SageMaker costs, it’s really a fraction of what my time or a time of a person like me costs. It’s a negligible fraction of what a skilled ML/AI person in the United States would cost. Every minute that I spend worrying about infrastructure is way more costly than paying Jeff Bezos to give it to me.

Of course, at some point, this analogy breaks down.

But again, at a reasonable scale, when you have few people, your cost is people. Literally, your only cost is people. Because if somebody’s not happy because they do infra, instead of ML, they’re going to leave. Replacing them is going to cost even more.

There are all these hidden costs that we don’t take into consideration because it’s people’s costs. They are way bigger than your AWS bill, infinitely bigger. Optimize for people’s happiness at a reasonable scale. Then when you really become big, you can optimize for computing as well.

Sabine: We had a hardware question from Shrikant. Jacopo, do you use NVIDIA GPU-powered compute for your model training inferencing? If yes, how well are the GPUs abstracted for custom consumption needs across different functions within one model?

Jacopo: Yes, we use the standard NVIDIA stuff that is provided through AWS, so bought in batch and in SageMaker. Yes, we use NVIDIA stuff, and we are evaluating the use of also NVIDIA open-source inference service, Triton, as the software stack to actually do the deployment.

These two things are not exclusive. You can have a SageMaker-powered endpoint, which runs Triton as your inference software.

The abstraction is actually fairly transparent. When you actually tell Metaflow, what to run on AWS batch, you can point Metaflow to a specific container that you built with the right dependencies.

What happens is: you tell Metaflow, “hey, run this training code for this Keras model on this container, and this container is proven to run on AWS patch GPU enabled stuff” – it will just run out of the box. For SageMaker serving, it is the same thing. The serving instance that is pre-made for GPU is going to automatically work out of GPU things.

For us, this has been pretty much abstracted away. I don’t think we’re on the 99% of complexity about the GPU side because most of our GPU working is for research and prototyping. It may not be that this setup will satisfy somebody who wants to do GPU serving with very low latency. This is a problem that, for now, we haven’t had to have. But for research and iteration, our GPU abstraction with the AWS patch has worked very well so far.

Sabine: A question about orchestrating an ML workflow. Rafael has been using Perfect and Airflow. He’s saying, “I’m willing to use SageMaker inference endpoint to make model deployment easier. Would you say it makes sense to use SageMaker pipelines to orchestrate the workflow?”

Jacopo: I have no experience with the SageMaker pipeline, so my opinion here may not be decisive. I have experience with both Airflow and Prefect as orchestrators, and if you consider Metaflow as an ML orchestrator, that’s also an alternative.

I don’t think you’re forced to buy into the whole SageMaker pipeline just to get value out of this.

For example, we don’t. We use SageMaker for training. We find it a bit clunky as an API compared to just normal Python. I think you can make your evaluation independently. There’s no reason to jump into the full SageMaker stuff if you don’t want to.

On the other side, there are many reasons to go to SageMaker for things that are fairly decoupled from the rest of your orchestration (for example, endpoints).

How to build MLOps at a reasonable scale

Stephen: How do you think about what components to put in first when you’re building at a reasonable scale? Especially if you have two data scientists on your team or four engineers?

Jacopo: The first thing you need to do for your first ML project, the first rule is always: never do ML if you can. To get the feeling of our ML feature – the vast majority of value will be in the data, especially on a reasonable scale, where again, a marginal difference in the model will make less of a difference in your business outcome.

If your team is small, I would suggest:

Start with the data. Make sure that it’s clean and that it’s properly stored. We can discuss how to do that.
And then, you can work on this data to produce a small endpoint. In the very beginning, it doesn’t need to be an MLM point. It can be a set of rules, it can be the stupidest model you can think of, it can be a bag of words with a side kit to do test classification.
Then make sure that things work end-to-end – that things work from the data to your model, to the prediction, and then you can capture whatever feedback is your use case about. Let’s say if it’s recommendations, it’s a click. If it’s text classification, maybe the thumbs up or thumbs down, if the user wants to leave a review or something like that.
Once your two data scientist knows how to go from data to clean data, to model to the endpoint, and to feedback, once all these pointers are in place, decouple.
As long as you maintain the contact between these parts intact, now you can start improving. Now we remove the bag of words and replace them with whatever you want. Now we make data quality better. Instead of just checking how much data we have, we can use Great Expectations to check distribution or whatever.

Start small but start small in a thin slice. Because if you start small by focusing only on one part of this, you lose focus on the problem. What you need to solve is end-to-end. It goes from data to user and then back from user feedback to data again.

MLOps_pillars — *Examples for pillars of MLOps | Source*

What we try to do, is always prototyping a thin slice of a feature. Then you go back, and you improve all of that. You improve it well. You can improve for months. But the first thing is to spend as little amount of time as possible to get something out of the door. It is possible if you use good tools. If you use a SaaS experiment tracker. If you don’t waste your time on dumb stuff. That’s very important.

The amount of time you think about, let’s say, buying a SaaS experiment platform is going to cost more in the end than if you just sign up for it. You use it for a month, you ship something, and you see how it works, and then you decide. It’s really important.

Again, this comes from a very specific way of building a reasonable scale company, and comes from a very startup mentality when you continuously iterate. Where velocity is a virtue.

I do understand that’s not true in all contexts. Of course, I wouldn’t suggest doing this for healthcare or autonomous vehicle. There are places where moving fast is not a value, but there are a lot of places where actually moving fast is a significant component of being good at ML.

Sabine: We have a question in chat by Robert. He has some data privacy concerns here, and he’s asking if you have any recommendations about tooling to set up a full MLOps pipeline, except serving maybe for a small ML/AI team, three to five people with a hard requirement to run on-premises. They have some GitLab CI and DVC already set up and running.

Jacopo: Unfortunately, or fortunately for my life, e-commerce is a very non-sensitive area of data because e-commerce data is always hashed or anonymized in the first place. A lot of the data we deal with as interaction data or what people do on our website has normal security concerns but not healthcare or it’s not particularly sensitive. Once you’re compliant with GDPR and general regulations, your work as a data scientist is somehow made easy by the business case.

I don’t have huge experience in privacy first environment, unfortunately. I know DVC is great, but all our life is a life about cloud and services, kind of the pitch of reasonable scale stuff is that cloud and services are good, as in they allow you to move much faster than if you have to maintain staff yourself with the obvious implication that sometimes when it’s not possible, you have to go on-prem and do things yourself.

I don’t know the use case in particular, but another thing that I found in my experience is that a lot of people tend to drastically overestimate the security of their own setup as compared to cloud providers or SaaS providers. They offer the same if not better security if somebody actually were to look into that.

Again, I don’t know the specific use cases, but I know other cases in another sector when the resistance to moving to the cloud is more cultural than actually a factual problem about the cloud being not secure in whatever definition of secure or private it is.

Maybe, in this case, setting up good data governance for dataset or for data access or for Snowflake or Metaflow or whatever, may be a way to make this privacy concern a bit less heavy.

Stephen: In terms of companies where it’s against the culture to allow their data and their processes and things like that outside of this particular stack. How do they navigate this problem where their culture relies on just having everything internally? How do they navigate this, and how do they adopt that reasonable scale MLOps approach to things?

Jacopo: That’s one of the hardest questions there is. It’s a question about humans, it’s not a question about Python or SageMaker or whatever. It’s very hard. In my experience working with dozens of organizations that are our previous customers or current customers or friends that I have all over the industry, the cultural aspect is the hardest one to change.

In fact, a lot of the pushback we have with our Bigger Boat repo or other solutions we put out as open sources is, “Well, you’re lucky because they allow you to all of these things in your company. In my company, there’s a team for data, there’s a team for the model, there’s a team for deployment or whatever (you can map this into your own experience).”

For us, this is the mother of all battles, if you can fight it. We subscribe to the idea of the end-to-end machine learning practitioner or data scientist. There’s a person that:

1 Can see the data that has been cleaned and prepared, of course, by a data expert,
2 Can prepare his own feature,
3 Can train his own model,
4 and can ship it out.

In everything we do, going back to the idea of a team slice idea is to empower one person, even one person, maybe more, but it can even be one person to be able to do all of this and produce business value without ever talking to anybody else. Without talking to a DevOps person, without talking to an engineer, without talking to security, without talking to anybody.

Everything else that is not abstracted away enough so that one machine learning engineer can go from data to end-point it’s an opportunity for improvement. The more companies understand this, the more fluid they’re going to get out of their ML initiative.

One of the problems of the company that starts with a small team of data scientists is that instead of embedding them into the company workflow and culture, they put them in a silo to do notebooks or whatever is on their laptops.

Then after one year, Gartner says 97% of the projects never make it to production (my LinkedIn feed is literally filled with Gartner saying that people never ship staff to production). We can say it’s the fault of the data scientists, maybe, but we didn’t really set them up for success. We put them in a bubble and do, “Hey, do some magic stuff.” Now whatever magic they do, for us, is impossible to consume in the company.

My suggestion for people starting up is counterintuitive, but start with very few people and get people that understand the end-to-end value.

They don’t need to be the best people in modeling, they don’t need to be the best people in SQL, and they don’t need to be the best people in infrastructure, but they understand all of these problems. I understand these people tend to be expensive because they’re not very many people that actually have done this before, end-to-end. But one or two people that know what they’re doing will make all the next 10 hires infinitely more productive.

If you do the other route, which is hiring 20 people that know PyTorch out of a university (which I’ve seen happen so many times), now you have 20 people that are relatively expensive, maybe less expensive per person, but relatively expensive. They run around to find data, ship in a notebook, but nothing ever gets done. And after one year, the exec says, “Well, this ML thing doesn’t work.”

For sure, it doesn’t work like that. I suggest another approach. When you hire expert people first, you build a productivity tool, and then you hire people to actually do the modeling and to improve on that. That would be my suggestion for people starting up. Again, this is the hardest part because it’s a cultural battle; it’s not a technical fight.

Sabine: Would you recommend going with full-stack engineers or more specialized data scientists, ML engineers, or software engineers?

Jacopo: For building the MLOps top part, I think somebody that understands a bit how the life cycle of an ML model is needed.

1 It may be an exceedingly talented software engineer that has been exposed to, I don’t know, come from Netflix, that has been exposed to modeling as well.
2 It can be a data engineer that has some knowledge of ML.
3 Or it can be an ML person that has some knowledge of data.

In my experience, people that are coming from just a software engineer background, tend to be very good at some of these tools, but they tend to underestimate some other problems that arise in ML, typically the fact that data continually changes. The behavior of the system is actually a distribution, it’s not like a linear path, a normal software.

And somebody that is only a data scientist who is always focused on modeling may underestimate the other complexity – how hard it is to make the data clean available, scalable, and so on and so forth.

That’s why I’m saying, I know it’s hard to find these people that know a bit of everything. But in my experience, they tend to be the best people to at least set up a good practice. They’re your data leaders. Then after you get your data leaders, you can get people with the most specialized expertise to do their job.

Delivering models to customers and monitoring them in production

Sabine: We then have a few questions about a different topic.

1 How do you deliver models to customers?
2 Do they use your API, or does it work differently?
3 And how do you know how the models are performing in production?
4 How do you know when to update them, for example?

Jacopo: That’s two very good questions. For the first question, I’m going to split it between two. One is, let’s say, prototypal internal deployment, and one is, let’s say, the public-facing global availability deployment. Because, again, remember what we do is literally API, our product as Coveo or Tooso before is literally APIs. The model is the product. We’re not a model embedded in Airbnb, the model is actually the product.

For the globally available stuff, there’s an entire infrastructure that’s been built by Coveo engineers over the year, and it’s based on our own internal tooling. It’s actually based on Kubernetes, and it was done in a time when past services like Fargate or whatever there is, were less available.
For the newer stuff, like prototyping and research stuff that we do, especially in my team, we’re very happy to use hosted services instead.
- If it’s very simple stuff, we can use, let’s say, Fargate. (Fast API with Faregate is like the standard data science tutorial that you can find online.)
- For slightly more complex models, we use Sagemaker because it gives you this Python-based API so that if you’re using Metaflow, like in our case, you can have a final step in your pipeline after the experiment tracking and so on when you deploy your models with two lines of Python.
  It plays nicely with Metaflow because the model artifacts are stored already in S3, so you just tell Sagemaker where the model artifact is, and then Sagemaker is going to spin up an end for you. Which you’ll find is a very good way to iterate and build internal endpoints.
  Sagemaker is not perfect by any measure. We find the management of dependencies and custom containers could be vastly improved, but once you get around that, it’s actually very quick to spin up something in your Python base.

That hopefully answered the first part of the question. The second part about monitoring is even more interesting stuff. But it’s a newer part for us as a team and as a company.

The monitoring space has been blooming in the last six months or so. We’re actually evaluating the open-source platforms and SaaS providers in that space because we do something that’s very quick, which is information retrieval.

Most of what we do is either recommendation or search, and so that kind of use case is not the usual one that the people mention when they do the monitoring. Maybe they mention the loan prediction problem or NLP, but most of our company actually does information retrieval. So it deals with pseudo feedback, and it has a different constraint than most of these.

When we have an answer, we’ll probably publish a blog post or another open-source repo to tell you what we think. But for now, the monitoring space is a very work-in-progress area also for us.

Scaling and improving the MLOps workflow

Stephen: If you’re starting our reasonable scale, MLOps, within that thin slides and then start doing things on top of that. How do you determine when something should just be a bash script that you automate over something very complex picked off the shelf, a SaaS platform?

Jacopo: For me, very pragmatically, everything that works is fine. The guy used to say perfection is achieved not when there is nothing more to add but when there is nothing less to take away. The fewer moving pieces you will have, the more robust your system will be. That’s for sure. There’s no need to get an entirely new piece of infrastructure if you can just run three lines of Python to solve your problem.

I typically find that if you get the functional component of your work right, let’s say, data, cleaning, training, tracking, deployment, etc. If you get them right inside of these boxes, some of these things may be super easy at first, maybe a small script, three lines of patterns, whatever, but it leaves you an opportunity to grow in the future when these three lines of Python won’t do anymore.

Start with the bash script or whatever you do for one use case, but always put that script or that line of Python into a pipeline that grows. People are not slowed down typically by what’s in a box, people are slowed down by how the boxes are connected together. That’s why I’m saying the first, figure out how the boxes are connected, and then go into each of these boxes (if you want, if you need) to make them better.

Stephen: I also have a question about scaling your stack. For example, if I have a use case where I’ve built one vision model, and I’m serving maybe a couple of handful of requests a day. But I know for sure that the models are going to scale definitely. I’m going to be building models to solve different problems, like 20 models in one year. Those probably will be serving millions of requests.

How do I think about going from zero to a reasonable scale and not breaking things along the way? Not breaking my bash scripts or stuff like that, that I’ve built?

Jacopo: If your concern here is future scalability, meaning that we’re starting with a small dataset, small feedback, small serving, and we can increase that over time. This is an important point, the increase may not be uniform.

Maybe the training data stay pretty much the same, but now you’re serving a billion users because your model is way more successful.
Or maybe everything is scaled up.
Maybe it’s information retrieval, so what you’re actually ingesting is feedback data, and so also your training data is going to go up because you’re serving is being successful.

Maybe it’s slightly correlated, but it also may be completely independent. So it’s important that your pipeline is able to sustain this independent scaling.

A Snowflake solution will give you that out-of-the-box for data ingestion and preparation. It will work with 100 gigabytes of data with 100 terabytes. The only thing that’s going to change is your bill, but your code is going to run exactly as it is with no changes.
On the training side, what you will need to do, let’s say you’re using Metaflow AWS batch, you can automatically scale it up to a certain point just in code. For example, if first you were using one GPU, now you can use four, and it’s a negligible change to your code as you progress. Everything else stays the same.
Now we get to the deployment part.
- Let’s say at day zero, you can even do SageMaker serverless. I haven’t tried it yet. I know that it’s been held for two months or something, but now there’s a SageMaker option, which is just paid for inference. You don’t even pay when it is not queried. For the first model, as you said, it’s a couple of requests per day. So it may be a super cheap solution for you. It’s going to basically be free for your first month.
- Then when it gets better, maybe you can go into the standard SageMaker with auto-scaling. It’s going to be probably two lines of code of difference in your pipeline.
- And when you’re really at scale, then SageMaker is going to cost a lot and then you probably have to go into other deployment options out there. But this is an easy way in which you build something that on day one works and on day 365 is going to work literally 98% the same.

The “buy vs build & maintain” dilemma

Stephen: Looking at small teams, for example, one thing that’s very crucial is an ongoing or everlasting argument in the software industry: the build versus buy versus open-source.

How do you think about that in the context of reasonable scale stacks?

Jacopo: For us, it boils down to infrastructure being always bought (I think maybe because I personally really suck at infrastructure). Anything that needs maintenance for me is always a buy at the end of the day. Even maintenance in a derived setup.

For example, my company Tooso was built on EMR and Spark for our data processing. And I was one of the happiest people in the world that Snowflake got invented because I didn’t have to deal with Spark anymore. I’m super happy to buy all the computing just to better abstract away all the problems of distributing queries and so on and so forth.

For training as well. I can’t be bothered by spinning up one GPU or two GPUs. That’s really not a good use of my time.

In the endpoint, as we discussed, up until a certain point is also okay to buy endpoints just because they’re there. Up to a certain point, that becomes very expensive. Again, things will change. For me, always buy computing. Always offload how things are run and maintained and how they scale.

Because at the end of the day, the value of my work is in the line of Python, in my Metaflow step. That’s what I’m being paid.

Then to do that job, I need to have computing, I need to have experiment tracking, I need to have deployment – but my job, at the end of the day, is what I put in those lines and everything else for me is almost always a buy.

I know that I’m on the very extreme spectrum here. But again, if you think about:

1 how much an ML engineer costs today in the United States,
2 and how much cost it would take me to replace them,
3 and how much I’m investing in these people in my team,

Ultimately, as much as the past solution may cost, they cost a fraction of what my team’s happiness actually cost to my company.

Stephen: How do you compare the pricing of these solutions? Say, for example, an open-source solution versus a typical SaaS solution. Do you have something you look at? Do you have a framework you use to compare costs in the tools you use for your stack?

Jacopo:

Some of the tools we use do not have any other alternative, e.g. Snowflake. The competition of Snowflake is just that you change your architecture. For example, you go on Spark. And that’s a big no from what we do. That’s one thing.
For other stuff, it really depends on the use cases. For SaaS experimental platform, the barrier to entry is very small. They make the team very happy. It’s a small change in the code base that makes the team very happy, and typically they cost a very reasonable amount of money at a reasonable scale. You can start with very little. It’s something that I would suggest buying every time.
For other things, let’s say, Prefect, an orchestrator. It’s a very important part of your stack. It comes either open source for yourself or as a platform. What do you do?
Depends on what you feel, but most of these tools have an entry-level that is very small on the SaaS side. You can start with that, and then you can always go back.

My suggestion at the end of the day is to start with something. If:

You don’t know what you want, or maybe you know what you want,
But you didn’t get to the maturity level of being able to do the thin slice, to do a new model in one day end-to-end (this is where we are at),
But you’re still starting up,

Buy stuff first and see if you like them.

And then if you really like Prefect, and there are 100 people that want to use Prefect in your company, okay – maybe you can put it in-house and maintain it yourself.

Let’s not put the cart in front of the horses. Let’s see if we like it first. Then if we like it, we can always decide what to do at scale.

That’s why I say with SageMaker – before building your own deployment on Kubernetes, build a modern SageMaker and see if it works, see if it provides value. After that, you can always go back and redo this in Kubernetes.

I think a lot of people are afraid of SaaS cost when actually building stuff up front that you don’t even know if you need is way more costly in expectation.

Start SaaS, and then you can always go back and do something else because that would be a “happy problem”. If the model that you built is so successful that now millions of people use it – the problem with scaling StageMaker is a “happy problem” to have.

It would be terrible, however, if we have taught two data scientists how to use Kubernetes (which would take six months of salaries to do that), and now the model is not used by anybody, so we wasted six months of time. These people will leave because they’re unhappy. Instead of just buying what is out there.

The limits of reasonable scale

Sabine: What are the limits of reasonable scale? When are you too big to be reasonable? Do you see it as a matter of team size or amount of models or model accuracy, the relevance?

Jacopo: In the TDS series that I co-edited with my colleagues, Andrea and Ciro, we have a bunch of dimensions. They are somehow correlated with each other. It’s not a precise definition, but it gives you the feeling of where you are.

One is data size, of course – if you have a petabyte of data a day, the reasonable scale doesn’t really apply to you.
One is the team size – until you’re like, 5, 6, 7, 10 people, again, the overhead of maintaining stuff almost never pays off your team productivity and happiness. So the reasonable scale fully applies. And then when you go bigger, of course, things may change.
The other point is the use case – again, I think a lot of people (and I say this – no harm, no foul), but a lot of people overestimate how good their model needs to be at day one to produce value.
This is a problem for the Amazon of this world. It is not a problem for most companies, and certainly, not for most companies at a reasonable scale.
Most people that are starting up ML right now, some of the most exciting ML use cases are inside enterprises now. When the bar to beat is an Excel spreadsheet, when the bar to beat is a clunky workflow.
This is where the value is right now, for the most part, if you don’t work in one of these five companies. The most important thing for your ML is literally that it works at all. If it works at all with no maintenance, good monitoring, and good scaling, that’s literally 90% of the value. I think a lot of people think they’re not at a reasonable scale because they overthink a bit what the bar to beat is.
The bar to beat is to provide business value versus the status quo at a reasonable cost. That literally qualifies probably 90% of use cases. Unfortunately, it’s not what you read on the internet. My metaphor is always like, “We’re all trying to learn how to play tennis, and the only thing we watch is Roger Federer training. Which is very inspirational, but:
- we’re not Roger Federer; nobody is,
- even if now we’re not Roger Federer, the people that are going to be Roger Federer at the end of the training are still very few.

Reasonable-scale-companies — *Graph showing area covered by the reasonable scale | Source*

Look at what YouTube, Lyft, and Uber do. There are a lot of lessons there. But don’t really try to map them one to one to your life. Because your life is very different. It’s very different in some constraint ways, but also in some opportunities as in you can use a lot of tools that wouldn’t make sense at Uber scale.

Resources for learning MLOps

Sabine: We have a question about a bit of a different topic. Shrikant wants to know if you have any reading recommendations on gaining new perspectives in improving the machine learning delivery lifecycle. Any books?

Jacopo: There are two amazing books.

One is already out. It’s by Ville Tuulos, which is a Finnish guy that co-created Metaflow, called “Effective Data Science Infrastructure”
Then there’s an upcoming book by Chip Huyen on ML systems as well.

These are two books that I warm-heartedly recommend.

If you want something to start as an open-source, available content:

Made with ML is a very, very, very incredible resource by my friend Goku.
Chip’s course at Stanford is also open source, it’s a course on ML systems.
A part of my course at NYU is also open-source. You can find it on GitHub.

These are more courses, so slides and snippets of code, more than books, but they’re free, and you can start tomorrow.

Was the article useful?

More about Setting up MLOps at a Reasonable Scale With Jacopo Tagliabue

Check out our product resources and related articles below:

3 Takes on End-to-End For the MLOps Stack: Was It Worth It?

Building a Machine Learning Platform [Definitive Guide]

ML/AI Platform Build vs Buy Decision: What Factors to Consider

MLOps Journey: Building a Mature ML Development Process

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Setting up MLOps at a Reasonable Scale With Jacopo Tagliabue

What is reasonable scale MLOps?