Building ML Pipeline: 6 Problems & Solutions [From a Data Scientist’s Experience]
Long gone is the time where ML jobs start and end with a jupyter notebook.
Since all companies want to deploy their models into production, having an efficient and rigorous MLOps pipeline to do so is a real challenge that ML engineers have to face nowadays.
But creating such a pipeline is not an easy task, given how new the MLOps tools are. Indeed, the field itself is no more than a couple of years old for the vast majority of medium-sized companies. Thus creating such a pipeline can only be accomplished through trial and error, and the mastering of numerous tools/libraries is needed.
In this article, I will introduce you to
- common pitfalls I have seen in the previous companies I have been working at,
- and how I managed to solve them.
This is by no means the end of the story, though, and I am sure that the MLOps field will be at a way more mature level two years from now. But by showing you the challenges I faced, I hope you will learn something in the process. I sure did!
So here we go!
May be useful
If you’re interested in more generic guidelines, check this article on how to build an end-to-end ML pipeline.
A short note on the author
Before proceeding, it may be enlightening for you to have a little background on me.
I am a French engineer who did a master’s and a Ph.D. in particle physics before leaving the research ecosystem to join the industry one, as I wanted to have a more direct impact on society. At the time (2015), I only developed codes for myself and maybe 1-2 co-authors, and you can therefore guess my production-compatible coding abilities (in case you did not: there were none :)).
But since then, I have contributed to different codebases in different languages (C# and Python mostly), and even if I am not a developer by formation, I have seen more than once what works and what does not :).
In order to not destroy all of my credibility before even starting the journey, let me hasten to add that I do have a non-zero knowledge of deep learning (this white book made available to the community on github in 2017 can hopefully attest to this fact :)).
Building MLOps pipelines: the most common problems I encountered
Here are the 6 most common pitfalls I have encountered during my ML activity in the past 6 years.
I’ll dig into each of one them throughout the article, first presenting the problem and then offering a possible solution.
Problem 1: POC – style code
More often than not, I encountered code bases developed in a Proof Of Concept (POC) style.
For instance, to release a model into production, one may have to chain 5 to 10 click commands (or even worse, argparse!) in order to be able to:
- preprocess data
- featurize data
- train an ML model
- export the ML model into production
- produce a CSV report on the model performance
In addition, it is very common to need to edit the code between two commands for the full process to work.
This is normal in startups, they want to build innovative products, and they want to build them fast. But in my experience, leaving a code base at the POC level is a long-term recipe for disaster.
Indeed, adding new features in this manner becomes more and more costly as maintenance costs become higher and higher. Another factor worth considering is that in companies with even regular turnover, each leave with this kind of code base has a real impact on the structure speed.
Problem 2: No high-level separation of concerns
The separation of concerns in ML code bases is often missing at a high level. What this means is that more often than not, so-called ML code is also doing feature transformations like operations that have nothing to do with ML – think physical document ingestion, conversion of administrative data, etc.
In addition, the dependencies between these modules are often not well thought out. Look at fantasy diagram created by a small wrapper coded by me (I aim to release it on PyPI one day :)) and based on the excellent pydeps that gives a code base dependencies at the regroupment of module levels (this is closer to real life situations that you might think :):
To me, the most worrisome aspect of this diagram is the number of cyclic dependencies present between what seems to be low-level packages and high-level ones.
Another thing that I personally interpret as a not well-thought-out architecture is a large utils folder, and it is very common to see utils folders with dozen of modules in ML codebases.
Problem 3: No low-level separation of concerns
The separation of concerns in the code is unfortunately often missing at a low level as well. When this happens, you end up with 2000+ line classes handling almost everything: Featurization, preprocessing, building the model graph, training, predicting, exporting… You name it, those master classes have your bases covered (only coffee is missing, and sometimes you never know… :)). But as you know, this is not what the S of the SOLID would recommend.
Problem 4: No configuration Data Model
A data model for handling ML configuration is often missing. For instance, this is what a fantasy model hyperparameter declaration could look like (again, closer to real-life situations than you might think).
Even more problematic (but comprehensible), this allowed for dynamic modification of the model configuration (fantasy snippet inspired from numerous real-life situations):
As one can see in the fantasy code snippet above, the `params` attribute is modified in place. When this happens at several places in the code (and trust me, it does when you start going down that road), you end up with a code that is a real nightmare to debug, as what you put into configurations is not necessarily what arrives in the subsequent ML pipeline steps.
Problem 5: Handling legacy models
Since the process of training a ML model often involves manual efforts (see problem 1) it can take really long to do so. It is also prone to some errors (when a human is in the loop errors are also :)). In that case, you end up with (fantasy code snippet) stuff like this:
Hint: look at the docstring date 🙂
Problem 6: Code quality: type hinting, documentation, complexity, dead code
As the above fantasy code snippets can attest, type hinting is rarely present when it is needed the most. I can guess that n_model_to_keep is an int, but would be hard-pressed naming the types of graph_configuration in the code snippet of problem 5 .
In addition, ML code bases I encountered often had a limited amount of docstring, and modern concepts for code quality like cyclomatic/cognitive complexity or working memory (see this post to learn more about it) are not respected.
Finally, unknown to all, a lot of dead code is often present in the solution. In this case, you might scratch your head during several days when adding a new feature before realizing that the code you do not manage to make it work with this new feature is not even called (again, true story)!
Building MLOps pipelines: how I solved these problems
Let’s now look at solutions I found (of course with the help of my collaborators along the years) to the 6 pressing problems discussed above and give you an overview of where I would be if I had to develop a new MLOPS pipeline now.
Solution 1: from POC to prod
Thanks to Typer, a lot of click/argpase boilerplate code can be suppressed from command lines.
I am a big fan of a couple of mantras:
- The best code is the one you don’t need to write (funny folklore on this).
- When an observation starts to be used as a metric (in this case, the number of lines written to attest all the work done), it stops being a good observation.
Here is, in my opinion, a good high-level command signature to launch an end-to-end ML model training:
TL DR: use Typer for all your command line tools.
Solution 2: Handling high-level separation of concerns – from ML monolith to ML microservices
This is a big one that took me a long time to improve on. As I guess most of my readers are today, I am on the side of the microservice in the microservice/monolith battle (though I know that microservices are not a miracle that solve all development issues with a finger snap). With docker and docker-compose used to encompass the different services, you can improve on the functionalities of your architecture incrementally and in isolation with the rest of the already implemented features. Unfortunately, ML docker architecture often looks like this:
Now I would advocate for something more like this (with the data processing parts also acknowledged):
The data ingestion and storing functionalities that are not ML related are now delegated to a dedicated feature-store container. It stores the data it ingests into a MongoDB (I am used to work with non structured documents, but of course if you are also/only dealing with structured data use a Postgresql container) container, after having processed the documents it is fed with via calls to a gotenberg container (a very useful off the shelf container to handle documents).
The ML is here split into three parts:
- A Computer Vision part: document-recognition container, applying computer vision techniques to documents Think the usual suspects: open-cv, Pillow… . I have experience doing the labeling with the help of a label-tool container, but there are a lot of alternatives out there.
- An NLP part: NLP, with a container applying NLP techniques to the texts extracted from the documents. Think the usual suspects: nltk, Spacy, DL/BERT… I have experience doing the labeling with the help of a doccano container, and in my opinion there are no better alternatives out there :).
- A core DL part: a pytorch_dl container. I migrated from TensorFlow to PyTorch in all my DL activities, as interacting with TensorFlow was a source of frustration for me. Some of the problems I faced:
- It was slow and prone to error in my developments,
- Lack of support on the official github (some issues have been sitting there for years!),
- Difficulty to debug (even if the eager mode of tensorflow2 has mitigated this point to some extent).
You must have heard that codebases and functionalities should only be changed incrementally. In my experience, this is true and good advice 95% percent of the time. But 5% of the time things are so entangled and the danger of silently breaking by doing incremental changes is so high (low test coverage, I am looking at you) that I recommend rewriting everything from scratch in a new package, ensuring that the new package has the same features as the old one and thereafter, unplugging the faulty code in one stroke to plug in the new one.
I have handled TensorFlow to PyTorch migrations in my previous experiences as one of these occasions.
To implement PyTorch networks, I recommend using Pytorch Lightning which is a very concise and easy-to-use high-level library around PyTorch. To gauge the difference, the lines of code in my old TensorFlow codebases are in the order of thousands, whereas with Pytorch Lightning you can accomplish more with ten times less code. I usually handle in these different modules the DL concepts:
Thanks to PyTorch Lightning, each module is less than 50 lines long (except for network :)).
The Trainer is a marvel, and you can use the experiment logger of your choice in a finger snap. I started my journey with the good old TensorBoard logger, coming from the TensorFlow ecosystem. But as you can see on the above screen, I recently started to use one of its alternatives: yes, you guessed it, neptune.ai, and I am loving it so far. With as little code as the one you see in the code snippet above, you end up with all your models stored in a very user-friendly manner on the Neptune dashboard.
For hyperparameter optimization, I switched from Hyperopt to Optuna over the years, following this in-depth blog post. Reasons for this switch were numerous. Among others:
- Poor Hyperopt documentation
- Ease of integration with PyTorch Lightning for optuna
- Visualization of the hyperparameter search
Tips that will save you a LOT of time: to allow graceful model restart after the pytorch_dl container crashes for whatever reason (server reboot, server low on resources, etc.), I replay the whole TPEsamplings of the finished runs with the same random seed, and start the unfinished trial from the last saved checkpoint. This allows me not to waste hours on an unfinished run each time something unexpected happens on a server.
For my R&D experiments I use screen and more and more tmux (a good ref on tmux) scripts to launch hyperparameter optimization runs.
Hyperparameter comparison is very easy thanks to plotly parallel coordinate plot.
Finally, I use a custom reporter container to compile a tex template into a beamer pdf. Think jinja2 like tex template that you fill with PNGs and CSVs specific to each run to produce a PDF that is the perfect conversation starter with the businesses/clients when they come to understand Machine Learning Model performance (main confusions, label repartition, performance, etc.).
These architecture patterns drastically simplify coding new functionalities. If you are familiar with Accelerate, then you know it is no lie that having a good codebase can reduce the time taken to implement a new feature by a factor of 10 to 50, and I can attest to it :).
Should you need to add a message broker to your microservice architecture, I can recommend rabbit MQ as it is a breeze to plug within a python code thanks to the pika library. But here I have nothing to say on the alternatives (except readings: kafka, redis…) as I have never worked with them so far :).
Solution 3: Handling low-level separation of concerns – good code architecture
Having a clear separation of concerns between containers allows to have a very clean container-level architecture. Look at this fantasy (but the one I advocate! :)) dependency graph for a pytorch_dl container:
and the chronology of the different module actions:
High level view of the different regroupment of modules I advocate for:
- Adapters transform a raw CSV to a CSV dedicated to a particular prediction task.
- Filterers remove rows of the passed CSV if they fail to pass given filtering criteria (too rare label, etc). For both filterers and adapters, I often have generic classes implementing all the adapting and filtering logic and inheriting classes overriding the specific adapting/filtering logic of each given filter/adapter (Resource on ABC/protocols).
- Featurizers are always based on sklearn and essentially convert a CSV into a dictionary of feature names (string) to NumPy arrays. Here I wrap the usual suspects (TfidfVectorizer, StandardScaler) into my own classes, essentially because (for a reason unknown to me), sklearn does not offer memoization for its featurizers. I do not want to use pickle as it is not a security-compliant library and does not offer any protection against sklearn version changes. I thus always use a homemade improvement on this.
- PyTorch contains the Dataset, Dataloader, and Trainer logic.
- Model reports produce the pdf beamer reports already talked about above
- Taggers regroup deterministic techniques to predict (think expert rules) on rare data, for instance. In my experience, the performance of DL models can be improved with human knowledge, and you should always consider the possibility of doing so if feasible.
- MLConfiguration contains the ML data model: enums and classes that do not contain any processing methods. Think Hyperparameter class, PredictionType Enum, etc. Side note: use Enums over strings at all places where it makes sense (closed list of things)
- The pipeline plugs together all the elementary bricks.
- Routes contain the FastAPI routes that allow other containers to ask for predictions on new data. Indeed I left Flask aside for the same reasons that I left-click aside for Typer – less boilerplate, ease of use and maintainability, and even more functionalities. Tiangolo is a god :). I glanced at TorchServe to serve models, but given the project sizes I have been working on in my career, I did not yet feel necessary to commit to it. Plus TorchServe is (as of July 2022) still in its infancy.
I now always enforce regroupment of modules dependencies of my different codebases with a custom pre-commit hook. This means that each time someone tries to add new code that adds a new depency, a discussion is triggered between collaborators to evaluate the relevance of this new dependency. For instance, I see no reason as of today to create a dependency on model reports from pytorch given the architecture I presented. And would always vote against ml_configuration depending on anything.
Solution 4: Simple configuration Data Model thanks to Pydantic
To avoid config in code as an untyped giant dictionary, I enforce the use of Pydantic for all configuration/Data model classes. I even got inspiration from the best Disney movies 🙂 (see code snippet)
This enforces a configuration defined in one and only one place, hopefully in a JSON file outside the code, and thanks to Pydantic one-liners to serialize and deserialize the configuration. I kept an eye on Hydra, but as explained here (very good channel) for instance, the framework may be too young and will presumably be more mature and more natural in a few months/years.
In order to update the frozen configuration with the optuna trial, I usually just define a dictionary of mute actions (a mute action value for each hyperparameter key present in the optuna trial).
Solution 5: Handling legacy models with frequent automatic retrains
Since the entry point to train a model is a unique Typer command (if you followed solutions 1 to 4 :)), it is easy to cron it periodically and automatically re-train models. Thanks to the reports and the metrics it contains, you then have two levels to decide whether to put the new model in production or not.
- Automatic, high-level: if the macro performance of the new model is better than the old one, put the new model in production.
- Manual, fine-grained: an expert can compare the two models in detail and conclude that even if a certain model is somewhat worse than another in terms of overall performance, it could be better if its predictions make more sense when it’s wrong. For instance (here comes a completely fake vision example to clearly illustrate the point on ImageNet), the second conflates tigers with lions when it’s wrong whereas the first model predicts bees.
What do I mean by exporting a model into production? In the framework depicted above, it is essentially just copying a model folder from one location to another. Then one of the high-level configuration classes can load all of this in one, in order to do new predictions via FastApi and (in fine) PyTorch. From my experience, PyTorch eases this procedure. With TensorFlow, I had to manually tweak the model checkpoints when I moved models from one folder to another.
Solution 6: Improving code quality, a constant battle with a little help from my tools
On code quality and affiliated, I have several battle horses:
- As already mentioned, all the data model classes I implement are based on Pydantic (another python god: Samuel Covin).
- I docstring every method (but try to ban comments inside methods, which are, in my opinion, the sign of an urgent need to apply the good old extract method refactoring pattern :)). The Google style guide is a must-read (even if you do not adhere to all its aspects, know why you do not :)).
- I use sourcery to automatically hunt down bad designs and apply suggested refactoring patterns (you can find the current list here, and they add new ones on a regular basis). This tool is such a time saver – bad code does not survive long and your colleagues do not have to read it nor point it out during a painful code review. In fact the only extensions that I advocate every one to use on pycharm are sourcery and tabnine
- Among other pre-commit hooks (remember the homemade one on the high-level dependencies I already talked about) I use autopep8, flake, and mypy.
- I use pylint to lint my code bases and aim for a 9-9.5 target. This is completely arbitrary, but as Richard Thaler said – “I’m sure there’s an evolutionary explanation for this, if you give them [men] a target, they will aim.”
- I use unittest (this is the one I have experience with and I did not feel the need to switch to pytest. Even if it does mean some boilerplate I am more tolerant on the test side, as long as the tests exist!). For the same reason as the one mentioned in the last point, I aim for 95% coverage.
- I adopt the sklearn pattern for imports, meaning everything that is imported outside the folder regroupment of modules where the __init__.py stands must be listed in this very __init__.py. Every class/method listed here is the interface of the “package” and must be tested (unitary and/or functional).
- I often tried to implement cross-platform deterministic tests (read this and this) but failed (though I did succeed on a fixed platform). Since GitLab runners are changing from time to time this often leads to a lot of pain. I settle on having a performance “high enough” in end-to-end test.
- To avoid code duplication across several containers, I advocate for a low-level homemade library that you then install in each of your containers (via a command line in their respective Dockerfiles).
- Concerning CI, build your docker images in their respective GitLab pipelines.
- Try not to mount code in production (but do so locally to ease development. A very good reference blog on docker+python).
- I do not ship the tests in production, nor the librairies needed to run the tests (you should thus aim for two requirement files, one requirement-dev.txt not used in prod).
- I often have a custom python dev docker-compose file to ease my life (and the onboarding of new members) which is different from the production one.
- I advocate to (extensively) use the wiki part of your GitLab repos :), as the oral tradition was good at some stage of human history but is definitely not for IT companies :).
- I try to minimize the number of volumes mounted on my containers, the best number being 0 but for some data sources (like model checkpoint) it can be complicated.
- Handling dead code has a simple solution: Vulture. Run it, inspect (closely, as they are some false positives) its output, unplug dead code, rinse and repeat.
All too often, you see self congratulating articles hiding what real life really is in the ML field. I hope that you leave this post knowing this is not one of these articles. This is the honest journey I went on in the past six years developing MLOPS pipelines, and I can be all the more proud when I look back at where I was when I started coding in 2006 (a one line method of more than 400 characters in a C code :)).
In my experience, some switching decisions are easy to make and implement (flask to FastAPI), some are easy to make but not so easy to implement (like Hyperopt to Optuna) and some are hard to make as well as hard to implement (like TensorFlow to PyTorch), but all are worth the effort in the end to avoid the 6 pitfalls I presented.
This mindset will hopefully allow you to transition from a POC-like ML environment to an Accelerate compliant one where implementing new features can take less than an hour, and adding them to the code base takes less than another hour.
At a personal level, I learned an awful lot and I am deeply indebted to my previous employers and my previous colleagues for that!