GreenSteam is a company that provides software solutions for the marine industry that help reduce fuel usage. Excess fuel usage is both costly and bad for the environment, and vessel operators are obliged to get more green by the International Marine Organization and reduce the CO2 emissions by 50 percent by 2050.

Even though we are not a big company (50 people including business, devs, domain experts, researchers, and data scientists), we have already built several machine learning products over the last 13 years that help some major shipping companies make informed performance optimization decisions.

In this blog post, I want to share our journey to building the MLOps stack. Specifically, how we:
- dealt with code dependencies
- approached testing ML models
- built automated training and evaluation pipelines
- deployed and served our models
- managed to keep human-in-the-loop in MLOps
The new beginning
In 2019, a project team consisting of a data scientist, another machine learning engineer and me, was created. Our job was to build a new machine learning model that would replace some of the ones we used before.
We started like everyone else. We were having endless discussions, creating new notebooks, re-training, comparing the results. It worked for building a prototype, but we needed to organize it better.
As GreenSteam was getting more customers:
- we needed to deal with more and more data,
- we were finding more edge cases where our models were not performing as expected,
- we observed a data drift, and our models were not robust enough to handle it.
Another thing that bothered us was planning for the future. We knew our ML operations needed to grow with the company, and we needed a setup that could handle it. We wanted both a robust model, that would work with different kinds of data that we get from our customers, but also a scalable technological solution.
Sometimes, the best way to move forward is to take a step back, which is exactly what we did.
We’ve decided to start from scratch and rethink our entire machine learning infrastructure and operations.
Step 1: Shared ML codebase
In the beginning, each new idea was a new Jupyter notebook. As we experimented, we had more and more notebooks spread across our laptops. At some point, we started losing track of all the code changes in the notebooks. Merging them to keep some common codebase was a real pain as well.
It was time to extract parts of the shared codebase to a Python package and create a git repository for it. By doing that, we could still experiment with our ideas in the notebooks while version-controlling the part of the codebase that wasn’t changing as quickly. It is also way easier to diff and merge scripts than notebooks.
Step 2: Dealing with dependencies and reproducibility
At this stage, we also noticed issues with reproducibility. While the setup of our laptops was similar, it was not the same.
Using Conda solved some of the problems, but it was a fragile solution. From time to time, someone would install something in their local environment, making it diverge from the environment we all thought we had. It led to small but irritating “it works on my machine” kinds of bugs.
As our model matured, gradually, we were testing it on larger datasets, using more powerful cloud machines, not just our laptops. But when you move from the local environment to the cloud, another problem appears. It is easy to forget about some dependencies, and so having environment configurations that you can trust becomes essential.
“Docker helped us with that, and it was a start of a new era. It enabled us to have a unified, sealed setup, no matter who and where was running it.”
Docker helped us with that, and it was a start of a new era. It enabled us to have a unified, sealed setup, no matter who and where was running it. By setup, I mean the codebase, package versions, system dependencies, configs. Everything that you need to set up your environment.
At that point, we could launch Jupyter from inside the container, save the hash of all the code for versioning and generate results. This, together with storing everything in git, greatly improved the reproducibility and traceability of our codebase and model changes.
Step 3: How to deal with tests that don’t test
Testing the code was a problem by itself.
From the beginning, we aimed at high test coverage, always using type hints and paying attention to code quality. But with machine learning code, it is not always that simple.
Some parts of the model code were just hard to unit test. For example, to test them correctly, we would need to train the model. Even when you use a small data set, it is pretty long for a unit test.
What else can you do?
We could take the pre-trained model, but with multiple tests, the overhead for initializing it made the test suite slow enough to discourage us from running it frequently.
On top of all that, some of those tests were flaky, failing at random. The reason for that is simple (and hard to fix): you cannot expect the model trained on a small data subsample to behave like the model trained on the entire, real dataset.
“… we decided to drop some of the unit tests and run smoke test … We would monitor this pipeline for any problems, and catch them early.”
To solve all of those problems, we decided to drop some of the unit tests and run smoke tests. For every new pull request, we would train the model in a production-like environment, on a full dataset, with the only exception that the hyperparameters are set to values that led to getting the results fast. We would monitor this pipeline for any problems, and catch them early.
At the end of the day, we wanted our tests to flag failing code early on, and this setup solved it for us.

Step 4: Code quality, type checking, and continuous integration
Now that we could move between different execution environments (Docker) and had our tests set up, we could add continuous integration to make checks and reviews more efficient.
We also moved all the code checks into the Docker, so versions and configs of flake8, black, mypy, pytest all were also unified. This helped us with unifying the local development setup with what we used on Jenkins. No more problems with different linter versions showing different results locally and in Jenkins.
For local development, we had a Makefile to build the image, run it, and run all the checks and tests on the code. For code reviews, we set up Jenkins that was running the same checks as a part of the CI pipeline.
“…in many cases, mypy helped with finding issues with parts of the code that unit tests didn’t cover.”
Another safeguard that helped us sleep well at night was using mypy. People sometimes think of it as “just” a static analyzer that checks the types (for a dynamically typed language!). From our experience, in many cases, mypy helped with finding issues with parts of the code that unit tests didn’t cover. Seeing that a function receives, or returns, something else than we expected is a simple but powerful check.
Step 5: Adding orchestration, pipelines and moving from monolith to microservices
Now we needed to test our model for multiple datasets of different clients in different scenarios. This is not something you want to set up and run manually.
We needed orchestration, we needed automated pipelines, we needed to plug it into the production environment easily.
Having our ML code Dockerized made it easy to move between different environments (local and cloud), but it also made it possible to connect it into the production pipeline directly as a microservice.
It was also the time when GreenSteam switched from Airflow to Argo, so plugging in our container was just a matter of writing a few YAMLs.
Step 6: Keeping track of all those model versions and results
Now, we could efficiently scale our model building operation, but that triggered another problem: how do we manage all those model versions?
We needed to monitor the results of multiple models, trained on multiple datasets, with different parameters and metrics. And we wanted to compare all those model versions with each other.
After some research, we found two solutions that would help us do that, MLflow and Neptune. What convinced us to Neptune was that it was more of an out-of-the-box solution. It enabled us to tag the experiments, save the hyperparameters, model version, metrics, and plots in a single place.
READ MORE ABOUT NEPTUNE
️ Learn about ML experiment tracking: what it is, why it matters, and how to implement it
️ See how to organize experiments in the Neptune dashboard
Step 7: Custom model serving
Let’s get back to Argo for a second.
It is a pretty low-level solution for training machine learning models. Why didn’t we use something more data science-friendly, some all-in-one ML platform, for example?
One of the obvious reasons was that we were already using Argo for running other processes, but for us, an even bigger problem was model serving and deployment.
In a standard machine learning scenario:
- start with a dataset,
- push it through a preprocessing pipeline
- train your model,
- wrap it in some predict function that can be pickled or packed into a REST API,
- serve it in production.
In our case, however, our customers want us to make predictions for ships of different kinds and varieties. They can be slow dredgers, big cruise vessels, huge tankers, just to name a few, often sailing in very different conditions. In many cases, you cannot easily transfer what you learned from data on one ship to a different one.
Because of that, we need to train a separate model for each vessel type with constantly growing time-series datasets.
Moreover, unlike traditional machine learning, where a model predicts a single thing, we aim to have a model that predicts multiple characteristics of the vessel’s behavior.
Those predictions are then plugged into several services that help our customers plan things like trim, speed, scheduling cleaning of the hull to revert the impact of the fouling, etc. The bottom line is that each of those things affects fuel consumption in a big way, and our models can help make better decisions.
We looked at ready solutions like SageMaker or Kubeflow, but unfortunately, we did not find that they would serve our use case very well.
“We ended up packing models into microservices, … doing this using FastAPI turned out to be more effortless than we imagined.”
We ended up packing models into microservices, so if needed, we could easily replace one service with another if only it shared the same interface. Doing this using FastAPI turned out to be more effortless than we imagined.
To give an example of how flexible the setup is, we recently needed to roll a dummy model that would work very fast, just to test our infrastructure. Packing a simple Scikit-learn code into a microservice that we could be plug-in the production took just four days, including code reviews.
Step 8: Human-in-the-loop model auditing
We scaled it, but one bottleneck remained. How do we make sure that the model provides good business insights (not just good machine learning metrics)?
Our setup assumed that each of the trained models was audited and approved by a domain expert. At the end of the training, we produce a report, a static web page with plots and statistics describing the results.
This does not scale well. We tried to automate the rejection of the dubious results by creating a suite of checks that would query the model to verify if the predictions are aligned with what we would expect from physics and common sense. Those checks kind-of worked but sometimes failed in cases of data quality issues. Garbage in, garbage out.
We continue looking for a satisfactory solution, but since quality is so important we depend on a domain expert audit of every trained model’s results.
Step 9: Feature store and microservices (in progress)
The next step was to generalize what we have learned into a proper microservice architecture that could easily handle many different models.
Our aim right now is to ensure that our architecture would enable us to easily plug-in any future model as a microservice:
- We separate ourselves from the data pipelines by creating our feature store, with pre-computed features for the models.
- Our models run as separate microservices.
- While we are still partially running in a push mode, where the trained model sends the results to other parts of the system, we aim to make it a pull mode, where the services would call our API when needing access to the results.
Our current MLOps stack (and some lessons learned)
After working for over a year on this project, we learned a lot about how we could efficiently solve similar problems and transform them into working solutions. Of course, there is still a long road ahead of us, but we are hopeful that we can handle new challenges as they come.
I mentioned many great tools already but to give you a full picture of our current MLOps technological stack we use:
- Architecture: Microservices / Hybrid
- Pipelines: Argo
- Cloud: AWS EKS
- Environment management: Docker, Conda, Pip, Poetry, Makefiles
- Source code: git on BitBucket.com
- Development environment: Jupyter notebooks
- CI/CD: GitOps using Jenkins running code quality checks and smoke tests using production-like runs in the test environment
- Experiment tracking: Neptune
- Model testing and approval: Custom-built solution
- Code quality: pytest, mypy, black, isort, flake8, pylint
- Monitoring: Kibana, Sentry
- Model serving: Docker with FastAPI REST interface
- Machine Learning stack: Scikit-learn, PyMC3, PyTorch, Tensorflow
For me, the biggest lesson learned here was to keep it simple (you aren’t gonna need it!), don’t future-proof, but at the same time, keep it flexible enough so that you can easily change things in the future. We constantly refactor the code and experiment. Having a unified development environment, tests in place, being able to easily plug-in or replace the components into existing architecture makes it much easier.