Have you ever copy-pasted chunks of utility code between projects, resulting in multiple versions of the same code living in different repositories? Or, perhaps, you had to make pull requests to tens of projects after the name of the GCP bucket in which you store your data was updated?
Situations described above arise way too often in ML teams, and their consequences vary from a single developer’s annoyance to the team’s inability to ship their code as needed. Luckily, there’s a remedy.
Let’s dive into the world of monorepos, an architecture widely adopted in major tech companies like Google, and how they can enhance your ML workflows. A monorepo offers a plethora of advantages which, despite some drawbacks, make it a compelling choice for managing complex machine learning ecosystems.
We will briefly debate monorepos’ merits and demerits, examine why it’s an excellent architecture choice for machine learning teams, and peek into how BigTech is using it. Finally, we’ll see how to harness the power of the Pants build system to organize your machine learning monorepo into a robust CI/CD build system.
Strap in as we embark on this journey to streamline your ML project management.
What is a monorepo?
A monorepo (short for monolithic repository) is a software development strategy where code for many projects is stored in the same repository. The idea can be as broad as all of the company code written in a variety of programming languages stored together (did somebody say Google?) or as narrow as a couple of Python projects developed by a small team thrown into a single repository.
In this blog post, we focus on repositories storing machine learning code.
Monorepos vs. polyrepos
Monorepos are in stark contrast to the polyrepos approach, where each individual project or component has its own separate repository. A lot has been said about the advantages and disadvantages of both approaches, and we won’t go down this rabbit hole too deep. Let’s just put the basics on the table.
The monorepo architecture offers the following advantages:
- Single CI/CD pipeline, meaning no hidden deployment knowledge spread across individual contributors to different repositories;
- Atomic commits, given that all projects reside in the same repository, developers can make cross-project changes that span across multiple projects but are merged as a single commit;
- Easy sharing of utilities and templates across projects;
- Easy unification of coding standards and approaches;
- Better code discoverability.
Naturally, there are no free lunches. We need to pay for the above goodies, and the price comes in the form of:
- Scalability challenges: As the codebase grows, managing a monorepo can become increasingly difficult. At a really large scale, you’ll need powerful tools and servers to handle operations like cloning, pulling, and pushing changes, which can take a significant amount of time and resources.
- Complexity: A monorepo can be more complex to manage, particularly with regard to dependencies and versioning. A change in a shared component could potentially impact many projects, so extra caution is needed to avoid breaking changes.
- Visibility and access control: With everyone working out of the same repository, it can be difficult to control who has access to what. While not a disadvantage as such, it could pose problems of a legal nature in cases where code is subject to a very strict NDA.
The decision as to whether the advantages a monorepo offers are worth paying the price is to be determined by each organization or team individually. However, unless you are operating at a prohibitively large scale or are dealing with top-secret missions, I would argue that – at least when it comes to my area of expertise, the machine learning projects – a monorepo is a good architecture choice in most cases.
Let’s talk about why that is.
Machine learning with monorepos
There are at least six reasons why monorepos are particularly suitable for machine learning projects.
- 1 Data pipeline integration
- 2 Consistency across experiments
- 3 Simplified model versioning
- 4 Cross-functional collaboration
- 5 Atomic changes
- 6 Unification of coding standards
Data pipeline integration
Machine learning projects often involve data pipelines that preprocess, transform, and feed data into the model. These pipelines might be tightly integrated with the ML code. Keeping the data pipelines and ML code in the same repo helps maintain this tight integration and streamline the workflow.
Consistency across experiments
Machine learning development involves a lot of experimentation. Having all experiments in a monorepo ensures consistent environment setups and reduces the risk of discrepancies between different experiments due to varying code or data versions.
Simplified model versioning
In a monorepo, the code and model versions are in sync because they are checked into the same repository. This makes it easier to manage and trace model versions, which can be especially important in projects where ML reproducibility is critical.
Just take the commit SHA at any given point in time, and it gives the information on the state of all models and services.
Machine learning projects often involve collaboration between data scientists, ML engineers, and software engineers. A monorepo facilitates this cross-functional collaboration by providing a single source of truth for all project-related code and resources.
In the context of ML, a model’s performance can depend on various interconnected factors like data preprocessing, feature extraction, model architecture, and post-processing. A monorepo allows for atomic changes – a change to multiple of these components can be committed as one, ensuring that interdependencies are always in sync.
Unification of coding standards
Finally, machine learning teams often include members without a software engineering background. These mathematicians, statisticians, and econometricians are brainy folks with brilliant ideas and the skills to train models that solve business problems. However, writing code that is clean, easy to read, and maintain might not always be their strongest side.
A monorepo helps by automatically checking and enforcing coding standards across all projects, which not only ensures high code quality but also helps the less engineering-inclined team members learn and grow.
How they do it in industry: famous monorepos
In the software development landscape, some of the largest and most successful companies in the world use monorepos. Here are a few notable examples.
- Google: Google has long been a staunch advocate for the monorepo approach. Their entire codebase, estimated to contain 2 billion lines of code, is contained in a single, massive repository. They even published a paper about it.
- Meta: Meta also employs a monorepo for their vast codebase. They created a version control system called “Mercurial” to handle the size and complexity of their monorepo.
- Twitter: Twitter has been managing their monorepo for a long time using Pants, the build system we will talk about next!
Many other companies such as Microsoft, Uber, Airbnb, and Stripe are using the monorepo approach at least for some parts of their codebases, too.
Enough of the theory! Let’s take a look at how to actually build a machine learning monorepo. Because just throwing what used to be separate repositories into one folder does not do the job.
How to set up ML monorepo with Python?
Throughout this section, we will base our discussion on a sample machine learning repository I’ve created for this article. It is a simple monorepo holding just one project, or module: a hand-written digits classifier called mnist, after the famous dataset it uses.
All you need to know right now is that in the monorepo’s root there is a directory called mnist, and in it, there is some Python code for training the model, the corresponding unit tests, and a Dockerfile to run training in a container.
We will be using this small example to keep things simple, but in a larger monorepo, mnist would be just one of the many project folders in the repo’s root, each of which will contain source code, tests, dockerfiles, and requirement files at the least.
Build system: Why do you need one and how to choose it?
Think about all the actions, other than writing code, that the different teams developing different projects within the monorepo take as part of their development workflow. They would run linters against their code to ensure adherence to style standards, run unit tests, build artifacts such as docker containers and Python wheels, push them to external artifact repositories, and deploy them to production.
You’ve made a change in a utility function you maintain, ran the tests, and all’s green. But how can you be sure your change is not breaking code for other teams that might be importing your utility? You should run their test suite, too, of course.
But to do this, you need to know exactly where the code you changed is being used. As the codebase grows, finding this out manually doesn’t scale well. Of course, as an alternative, you can always execute all the tests, but again: that approach doesn’t scale very well.
Another example, production deployment.
Whether you deploy weekly, daily, or continuously, when the time comes, you would build all the services in the monorepo and push them to production. But hey, do you need to build all of them on each occasion? That could be time-consuming and expensive at scale.
Some projects might not have been updated for weeks. On the other hand, the shared utility code they use might have received updates. How do we decide what to build? Again, it’s all about dependencies. Ideally, we would only build services that have been affected by the recent changes.
All of this can be handled with a simple shell script with a small codebase, but as it scales and projects start sharing code, challenges emerge, many of which revolve around dependency management.
Picking the right system
All of the above is not a problem anymore if you invest in a proper build system. A build system’s primary task is to build code. And it should do so in a clever way: the developer should only need to tell it what to build (“build docker images affected by my latest commit”, or “run only those tests that cover code which uses the method I’ve updated”), but the how should be left for the system to figure out.
There are a couple of great open-source build systems out there. Since most machine learning is done in Python, let’s focus on the ones with the best Python support. The two most popular choices in this regard are Bazel and Pants.
Bazel is an open-source version of Google’s internal build system, Blaze. Pants is also heavily inspired by Blaze and it aims for similar technical design goals as Bazel. An interested reader will find a good comparison of Pants vs. Bazel in this blog post (but keep in mind it comes from the Pants devs). The table at the bottom of monorepo.tools offers yet another comparison.
Both systems are great, and it is not my intention to declare a “better” solution here. That being said, Pants is often described as easier to set up, more approachable, and well-optimized for Python, which makes it a perfect fit for machine learning monorepos.
In my personal experience, the decisive factor that made me go with Pants was its active and helpful community. Whenever you have questions or doubts, just post on the community Slack channel, and a bunch of supportive folks will help you out soon.
Alright, time to get to the meat of it! We will go step by step, introducing different Pants’ functionalities and how to implement them. Again, you can check out the associated sample repo here.
Pants is installable with pip. In this tutorial, we will use the most recent stable version as of this writing, 2.15.1.
Pants is configurable through a global master config file named pants.toml. In it, we can configure Pants’ own behavior as well as the settings of downstream tools it relies on, such as pytest or mypy.
Let’s start with a bare minimum pants.toml:
In the global section, we define the Pants version and the backend packages we need. These packages are Pants’ engines that support different features. For starters, we only include the Python backend.
In the source section, we set the source to the repository’s root. Since version 2.15, to make sure this is picked up, we also need to add an empty BUILD_ROOT file at the repository’s root.
Finally, in the Python section, we choose the Python version to use. Pants will browse our system in search of a version that matches the conditions specified here, so make sure you have this version installed.
That’s a start! Next, let’s take a look at any build system’s heart: the BUILD files.
Build files are configuration files used to define targets (what to build) and their dependencies (what they need to work) in a declarative way.
You can have multiple build files at different levels of the directory tree. The more there are, the more granular the control over dependency management. In fact, Google has a build file in virtually every directory in their repo.
In our example, we will use three build files:
- mnist/BUILD – in the project directory, this build file will define the python requirements for the project and the docker container to build;
- mnist/src/BUILD – in the source code directory, this build file will define python sources, that is, files to be covered by python-specific checks;
- mnist/tests/BUILD – in the tests directory, this build file will define which files to run with Pytest and what dependencies are needed for these tests to run.
Let’s take a look at the mnist/src/BUILD:
At the same time, mnist/BUILD looks like this:
The two entries in the build files are referred to as targets. First, we have a Python sources target, which we aptly call python, although the name could be anything. We define our Python sources as all .py files in the directory. This is relative to the build file’s location, that is: even if we had Python files outside of the mnist/src directory, these sources only capture the contents of the mnist/src folder. There is also a resolve filed; we will talk about it in a moment.
Next, we have the Python requirements target. It tells Pants where to find the requirements needed to execute our Python code (again, relative to the build file’s location, which is in the mnist project’s root in this case).
This is all we need to get started. To make sure the build file definition is correct, let’s run:
As expected, we get: “No required changes to BUILD files found.” as the output. Good!
Let’s spend a bit more time on this command. In a nutshell, a bare pants tailor can automatically create build files. However, it sometimes tends to add too many for one’s needs, which is why I tend to add them manually, followed by the command above that checks their correctness.
The double semicolon at the end is a Pants notation that tells it to run the command over the entire monorepo. Alternatively, we could have replaced it with mnist: to run only against the mnist module.
Dependencies and lockfiles
To do efficient dependency management, pants relies on lockfiles. Lockfiles record the specific versions and sources of all dependencies used by each project. This includes both direct and transitive dependencies.
By capturing this information, lockfiles ensure that the same versions of dependencies are used consistently across different environments and builds. In other words, they serve as a snapshot of the dependency graph, ensuring reproducibility and consistency across builds.
To generate a lockfile for our mnist module, we need the following addition to pants.toml:
We enable the resolves (Pants term for lockfiles’ environments) and define one for mnist passing a file path. We also choose it as the default one. This is the resolve we have passed to Python sources and Python requirements target before: this is how they know what dependencies are needed. We can now run:
This has created a file at mnist/mnist.lock. This file should be checked with git if you intend to use Pants for your remote CI/CD. And naturally, it needs to be updated every time you update the requirements.txt file.
With more projects in the monorepo, you would rather generate the lockfiles selectively for the project that needs it, e.g. pants generate-lockfiles mnist: .
That’s it for the setup! Now let’s use Pants to do something useful for us.
Unifying code style with Pants
Pants natively supports a number of Python linters and code formatting tools such as Black, yapf, Docformatter, Autoflake, Flake8, isort, Pyupgrade, or Bandit. They are all used in the same way; in our example, let’s implement Black and Docformatter.
To do so, we add appropriate two backends to pants.toml:
We could configure both tools if we wanted to by adding additional sections below in the toml file, but let’s stick with the defaults now.
To use the formatters, we need to execute what’s called a Pants goal. In this case, two goals are relevant.
First, the lint goal will run both tools (in the order in which they are listed in backend packages, so Docformatter first, Black second) in the check mode.
It looks like our code adheres to the standards of both formatters! However, if that was not the case, we could execute the fmt (short for “format”) goal that adapts the code appropriately:
In practice, you might want to use more than these two formatters. In this case, you may need to update each formatter’s config to ensure that it is compatible with the others. For instance, if you are using Black with its default config as we have done here, it will expect code lines not to exceed 88 characters.
But if you then want to add isort to automatically sort your imports, they will clash: isort truncates lines after 79 characters. To make isort compatible with Black, you would need to include the following section in the toml file:
All formatters can be configured in the same way in pants.toml by passing the arguments to their underlying tool.
Testing with Pants
Let’s run some tests! To do this, we need two steps.
First, we add the appropriate sections to pants.toml:
These settings make sure that as the tests are run, a test coverage report is produced. We also pass a couple of custom pytest options to adapt its output.
Next, we need to go back to our mnist/tests/BUILD file and add a Python tests target:
We call it tests and specify the resolve (i.e. lockfile) to use. Sources are the locations where pytest will be let in to look for tests to run; here, we explicitly pass all .py files prefixed with “test_”.
Now we can run:
As you can see, it took around three seconds to run this test suite. Now, if we re-run it again, we will get the results immediately:
Notice how Pants tells us these results are memoized, or cached. Since no changes have been made to the tests, the code being tested, or the requirements, there is no need to actually re-run the tests – their results are guaranteed to be the same, so they are just served from the cache.
Checking static typing with Pants
Let’s add one more code quality check. Pants allow using mypy to check static typing in Python. All we need to do is add the mypy backend in pants.toml: “pants.backend.python.typecheck.mypy”.
You might also want to configure mypy to make its output more readable and informative by also adding the following config section:
Shipping ML models with Pants
Let’s talk shipping. Most machine learning projects involve one or more docker containers, for example, processing training data, training a model, or serving it via an API using Flask or FastAPI. In our toy project, we also have a container for model training.
Pants support automatic building and pushing of docker images. Let’s see how it works.
First, we add the docker backend in pants.toml: pants.backend.docker. We will also configure our docker, passing it a number of environment variables and a build arg which will come in handy in a moment:
We call the docker target “train_mnist”. As a dependency, we need to pass it the list of files to be included in the container. The most convenient way to do this is to define this list as a separated files target. Here, we simply include all the files in the mnist project in a target called module_files, and pass it as a dependency to the docker image target.
Naturally, if you know that only some subset of files will be needed by the container, it’s a good idea to pass only them as a dependency. It is essential because these dependencies are used by Pants to infer whether a container has been affected by a change and needs a rebuild. Here, with module_files including all files, if any file in the mnist folder changes (even a readme!), Pants will see the train_mnist docker image as affected by this change.
Finally, we can also set the external registry and repository to which the image can be pushed, and the tags with which it will be pushed: here, I will be pushing the image to my personal dockerhub repo, always with two tags: “latest”, and the short commit SHA which will be passed as a build arg.
With this, we can build an image. Just one more thing: since Pants is working in its isolated environments, it cannot read env vars from the host. Hence, to build or push the image that requires the SHORT_SHA variable, we need to pass it together with the Pants command.
We can build the image like this:
A quick check reveals that the images have indeed been built:
We can also build and push images in one go using Pants. All it takes is replacing the package command with the publish command.
This built the images and pushed them to my dockerhub, where they have indeed landed.
Pants in CI/CD
The same commands we have just manually run locally can be executed as parts of a CI/CD pipeline. You can run them via services such as GitHub Actions or Google CloudBuild, for instance as a PR check before a feature branch is allowed to be merged to the main branch, or after the merge, to validate it’s green and build & push containers.
In our toy repo, I have implemented a pre-push commit hook that runs Pants commands on git push and only lets it through if they all pass. In it, we are running the following commands:
You can see some new flags for pants check, that is the typing check with mypy. They ensure that the check is only run on files that have changed compared to the main branch and their transitive dependencies. This is useful since mypy tends to take some time to run. Limiting its scope to what’s actually needed accelerates the process.
How would a docker build & push look in a CI/CD pipeline? Somewhat like this:
We use the publish command as before, but with three additional arguments:
- –changed-since=HEAD^ and –changed-dependees=transitive make sure that only the containers affected by the changes compared to the previous commit are built; this is useful for executing on the main branch after the merge.
- –filter-target-type=docker_image makes sure that the only things Pants does is build and push docker; this is because the pants publish command can refer to targets other than docker: for example, it can be used to publish helm charts to OCI registries.
The same goes for pants package: on top of building docker images, it can also create a Python package; for that reason, it’s a good practice to pass the –filter-target-type option.
Monorepos are more often than not a great architecture choice for machine learning teams. Managing them at scale, however, requires investment in a proper build system. One such system is Pants: it’s easy to set up and use and offers native support for many Python and Docker features that machine learning teams often use.
On top of that, it is an open-source project with a large and helpful community. I hope after reading this article you will go ahead and try it out. Even if you don’t currently have a monolithic repository, Pants can still streamline and facilitate many aspects of your daily work!
- Pants documentation: https://www.pantsbuild.org/
- Pants vs. Bazel blog post: https://blog.pantsbuild.org/pants-vs-bazel/
- monorepo.tools: https://monorepo.tools/