Neptune Blog

MLOps Journey: Building a Mature ML Development Process

Albin Sundqvist

6 min

25th September, 2024

ML Model Development MLOps

Building a great AI system takes more than creating one good model. Instead, you have to implement a workflow that enables you to iterate and continuously improve.

Data scientists often lack focus, time, or knowledge about software engineering principles. As a result, poor code quality and reliance on manual workflows are two of the main issues in ML development processes.

Using the following three principles helps you build a mature ML development process:

Establish a standard repository structure you can use as a scaffold for your projects.

Design your scripts, jobs, and pipelines to be idempotent.

Treat your pipelines as artifacts, not only the model or data itself.

When I first started working with AI, I was surprised at how complex and unstructured the development process was. Where traditional software development follows a widely agreed-upon streamlined process for developing and deploying, ML development is quite different. You need to think about and improve the data, the model, and the code, which adds layers of complexity. To meet deadlines, teams often rush or skip the essential refactoring phase, resulting in one good (enough) model but poor-quality code in production.

As a consultant, I’ve been involved in many projects and have filled different roles as a developer and advisor. I started as a full-stack developer but have gradually moved toward data and ML engineering. My current role is MLOps engineer at Arbetsförmedlingen, Sweden’s largest employment agency. There, among other things, I help to develop their recommendation systems and MLOps processes.

After deploying and managing multiple AI applications in production over the last few years, I realized that building a great AI system takes more than creating one good model. Instead, you will need to master a workflow that enables you to iterate and improve the system continuously.

But how do you achieve that, and what does maturity in ML development look like? I recently, together with a colleague, gave a talk at the MLOps Community IRL meetup in Stockholm, where we discussed our experience. This prompted me to think about this topic further and summarize my learnings in this article.

What is a mature ML development process?

Deploying can often be complex, time-consuming, and scary. A mature development process allows you to deploy models and pipelines confidently, predictably, and rapidly, helping with the swift integration of new features and new models.

Moreover, a mature ML process emphasizes horizontal scalability. Effective knowledge sharing and robust collaboration are essential, enabling teams to scale and manage more models and data per team.

Why do you need a mature ML process? And why is it hard?

A mature ML development process is tough to implement since it doesn’t just happen organically—quite the opposite. Data scientists focus mainly on developing new models and examining new data, preferably working from a notebook.

From these notebooks, the project grows. And once you find a model that is good enough, the switch from proof-of-concept to production happens fast, leaving you with an exploratory notebook that now runs in production.

All this makes the project very hard to maintain. Adding and adapting it to new data becomes tedious and error-prone. The root cause is that data scientists lack the focus, time, or knowledge about software engineering principles, which leads to an absence of a well-thought-out plan for the development process.

Problems x2

When I join teams, I often notice that many team members are scared of deploying. They might have had previous bad experiences or do not trust their underdeveloped development process.

For these teams, deployment typically includes a series of manual steps that must be executed in precisely the correct order. I’ve been in teams where we had to manually execute commands within a deployed container after discovering bugs in the new version. Executing commands in a running container is far from best practices and creates stress.

Deployment should be a reason for celebration. You should feel confident on release day, not uncertain and scared.

But why is it all too often not like that? Why do many teams regularly end up with brittle processes and faulty code in production?

At the end of the day, I believe it comes down to two problems:

Bad code quality: Data scientists are typically not software development experts and don’t focus on that aspect of their work. They create tightly coupled and complex code that is difficult to maintain, test, and support as projects evolve.
Manual workflows: A manual process makes each deployment treacherous and time-consuming. This slows down development and makes it hard to scale to more projects. As time passes, adapting to changes becomes increasingly more difficult because the developers forget what needs to be done or, even worse, the only people who know have left the team.

The solution to the problems

Addressing the two main challenges—integrating software best practices and reducing manual workflows—is key to being effective and scalable.

Code best practices

It’s good to follow best practices when writing code, and of course, this applies to an ML project as well.

There are many practices that you can adapt to and integrate to improve the code’s functionality and maintainability. Selecting those that bring the most value to your team is important. Here are some that I find well-suited for ML development:

Shift-left testing: This software development approach involves integrating testing throughout the development. In ML development, this translates to validating the data and the model assumptions early in the project lifecycle. Adopting a holistic testing approach that covers the entire ML pipeline is crucial.

You should test the entire ML model development chain, for example:

Data collection: Test the quality, accuracy, and relevance of the data collected to ensure it meets the needs of the model.
Feature creation: Validate and test the processes used to select, manipulate, and transform data.
Model training: Track all model training runs and assess the final model.
Model deployment:
- Test the model in a production-like environment to ensure it performs as expected when serving predictions.
- Test the integration between different components of your system.
Monitor: Track the data you collect and check if the predictions provide value to the application.

The iterative ML model development process without (left) and with shift-left testing (right). By incorporating testing in the different phases (and not just in the deployment/inference phase), teams can identify and resolve all types of issues early.

Loosely coupled code: My absolute favorite practice is loosely coupled code. At a high level, it means organizing the system so that each component and module operates independently of the inner structure of another. This modularity allows teams to update or replace parts of the system without affecting the rest.

Here’s a small example of loosely coupled code:

def train_model(training_config):
	X_train, X_test, y_train, y_test = load_data()

trainer = get_trainer(**training_config)

model, metrics = trainer.train(X_train, X_test, y_train, y_test)

saved = trainer.save(model, metrics)

return saved

In this example, you can easily swap or modify the trainer without affecting the training code. As long as the trainer adheres to the interface (i.e., it provides a train() and save() method), the code functions the same. Loosely coupling components makes development and writing tests easier.

Testable code: Writing tests and validating the functionality of individual components is key for maintainability. In ML projects, this typically entails creating tests for data preprocessing, transformations, and inference stages. Ensuring you can test each component independently speeds up debugging and development.

Enhancing data validation: Frameworks like Great Expectations and Pydantic significantly strengthen data consistency, making pipelines robust and dependable.

Code conventions and linting: A semi-low-hanging fruit is to enforce unified formatting rules and lint your code, for example, with a tool like ruff. Paired with good naming conventions, it helps you create coherent code with relatively little effort.

By integrating these best practices into ML development, teams can create more robust, efficient, and scalable machine learning systems that are more resilient, easier to manage, and painless to adapt to changes.

Workflow automation

If you increase your level of automation, you become faster and less error-prone. Manual processes often cause personal dependencies. Having processes that anyone in the team can confidently execute improves the delivery’s quality and maintainability.

Automating just a single step in a previously entirely manual process already provides substantial value.

In one project I was working on, everything was set up using a UI, which made releases a hassle. We often missed removing old stuff or made some (untraceable) mistakes. Our solution to that problem was GitOps. Storing the resources and configuration we needed in git and then using scripts to set it up in our cluster helped us create a stable release process.

Additionally, leveraging tools like feature stores, model registries, and job schedulers allows you to outsource routine functions, letting you focus on the tasks that are specific and important to your context and goals.

How to implement the solution

You’ll probably find that most people will agree that good coding practices and workflow automation are essential for a mature ML development process. But getting there is a real challenge.

Let’s break down the process of moving from where you are now to where you want to be into clear, achievable steps.

Repository structure

If I could only recommend one thing, I would tell you: Get a solid repository structure!

Organizing your code and configuration is essential in improving your ML development process. You can use a template like the Data Scientist Cookie Cutter or Azure’s ML Project Template as a starting point. Using this inspiration, create a template that serves your team.

The template provides a standard directory and file structure and dictates how you add essential workflows, automated tests, and validations. It allows you to build automation based on the standardized repository structure.

Here’s how a unified repository structure enables key practices:

Automated tests: CI/CD pipelines can expect that each repository contains a test folder (e.g., named /tests) and routinely try to run the tests before running or updating a job or pipeline.
Workflow standardization: Similarly, a well-defined repository structure will enforce workflow standards, creating an environment where you can reuse modules or even whole pipelines across projects. For example, a pipeline used for ingesting data might ensure that data loaded into the feature store gets passed through the tests and validations that are defined at a specific location in the repository.

Code examples and standards: The repository template can also contain definitions for coding standards and examples that help data scientists move their work from the exploratory notebook into production-ready modules and packages. These standards and examples serve as a guide for best practices and enhance the maintainability of the code, which increases efficiency and reduces the error rate in general.

Establishing a standardized repository structure sets a clear path in maintaining high standards throughout the project life cycle.

Shift the mindset

Establishing a mature ML development process requires the entire team to focus on the code and architectural design thinking.

Here are three ways in which you can facilitate this mindset shift:

Deploying pipelines vs. deploying models: As you progress in maturity, you move from deploying individual models or datasets straight from a data scientist’s workspace to deploying the whole assembly line that manufactured them. This is a more mature operational approach, as it greatly enhances the development process’s robustness and ensures it is well-controlled and repeatable.

Idempotent workflow design: It is crucial to design jobs, workflows, and pipelines so that running the same job or pipeline multiple times with the same input will always generate the same result. This makes your processes more foolproof, removes unwanted side effects due to re-execution of the job, and ensures proper outcome consistency. It also helps your team build confidence when deploying and executing jobs in the production environment.

Emphasizing shift-left testing: Moving testing to the earliest stage possible ensures that you identify issues as soon as possible and that a model’s deployment and integration are consistent. It also forces the team to devise a thorough plan for the project right from the beginning. What data do you need to track to operate the model in production? How will the users consume the predictions? Will the model serve predictions in batch mode or in real-time? These are just some of the questions you should have an answer to when going from PoC to product. Early testing and planning will ensure smoother integration, fewer last-minute fixes, and increased reliability of the whole system.

Be pragmatic, patient, and persistent

Increasing your MLOps maturity level will take time, money, and expertise.

As ML engineers, we’re often very enthusiastic about MLOps and the automation of practically everything. However, aligning what we do with the project, team, and product goals is essential. Sometimes, manual workflows can work well for a long time. Focusing too much on “fixing the manual process problem” is easy since it’s a pretty fun engineering challenge.

Always think about the real value of what you’re doing and where you’ll get the biggest “bang for the buck.”

One consequence of more automation is the increasingly complex maintenance of workflows. Introducing too many new tools and processes too early may overwhelm data scientists as they work to overcome the early hiccups of the new ML development approach and are learning to embrace the new mindset.

My advice is to start small and then iterate. It’s crucial to recognize when tools or automation of a specific part meet your needs and then shift your focus to the next part of your ML development process. Don’t look too far ahead into the future, but improve bit by bit and keep yourself grounded in the needs of your projects. And don’t adopt new shiny tools or methodologies just because they are trending.

Final thoughts

I’ve always loved deploying to production. After all, it’s when you finally see the value of what you’ve built. A mature ML development process enables teams to ship confidently and without fear of breaking production.

Implementing such a process can seem daunting, but I hope you’ve seen that you can get there in small steps. Along your journey, you’ll likely find that your team’s culture changes and team members grow along with the systems they work on.

Was the article useful?

More about MLOps Journey: Building a Mature ML Development Process

Check out our product resources and related articles below:

Building a Machine Learning Platform [Definitive Guide]

3 Takes on End-to-End For the MLOps Stack: Was It Worth It?

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Product resource

How Veo Eliminated Work Loss With Neptune

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025