How to Scale ML Projects – Lessons Learned from Experience
In the past decade, machine learning has been responsible for turning data into the most valuable asset for an organization. Building out business solutions with ML applications using vast datasets is easier than ever thanks to advancements in computational cloud infrastructures where you can host and run ML algorithms.
At the enterprise level, ML models have been applied to support plenty of different business aspects, from data source integration/data field augmentation to customized service or products recommendation, business operational cost-saving, and much, much more.
As members of the data science community, we all like to brainstorm ideas for ML models on a whiteboard, and then develop models on our machines. This general process works as long as the data and/or model is manageable. However, doing so at scale can become extremely challenging.
For example, managing and processing large quantities of data, selecting the optimal yet efficient training algorithms, and finally deploying and monitoring large-scale models — none of these are trivial matters that can be solved on a whiteboard.
Luckily, there are better ways to scale and accelerate ML workloads. In this article, I’ll be sharing the biggest challenges me and my team have faced, and lessons learned from working on ML projects at scale. Specifically, we’re going to discuss:
- What is scalability, and why does it matter?
- Challenges and lessons learned / best practices from my experience with scaling ML projects.
What is scalability, and why does it matter?
Similarly to software development, it’s easy to build an ML model that works for you and your team, but it can be extremely complicated when you need the model to work for people across the world. Nowadays, data teams are expected to build scalable applications that work for millions of users, residing in millions of locations, at a reasonable speed.
This has led to the rise of an entirely new field known as MLOps, dedicated to the containerization, orchestration, and distribution of ML applications — all of which make it easier to scale ML projects.
Why does scalability matter? Training an ML model with massive amounts of data and complicated algorithms is both memory- and time-consuming. It’s usually impossible to do it on a local computer. Scalability is about manipulating and computing huge datasets in a cost-efficient way, and it means that it’s easier to work with large amounts of data, along with other benefits:
- Boost productivity
ML model development involves lots of trial and error before finding an optimal model. A seamless workflow with fast execution will allow data teams to experiment with more ideas and be more productive.
- Enhance modularization and team collaboration
Modularization at scale makes it easy to reproduce and reuse ML applications. Thus, rather than working in silos to create hundreds of pipelines on their own, data science teams within an organization can share and collaborate.
- Promote ML automation and reduce cost:
A scalable automated process with minimum human interference not only makes the model less error-prone but also frees up resources to focus on more impactful tasks instead. Therefore, scaling helps to leverage resources to a maximum and reduce business costs.
Now that we have a basic understanding of ML scalability and the benefits it brings to the organization, let’s move on to five challenges that I have faced while developing scalable ML pipelines, and discuss the lessons I learned from dealing with those challenges.
Challenge #1 – Features are the hardest component to get right
ML features are sets of measurable properties or characteristics that act as input in the system. Depending on the specific projects, the source where these features come from can be everywhere, from tabular data tables stored in a data warehouse to human languages and image pixels, and finally to the derived predictors from feature engineering.
To serve a large-scale ML application, we need to access and experiment with billions of records to create a list of promising features that our model can digest. I have found the workflow of feature generation and storage quite challenging, especially when it’s done at scale. Admittedly, one may argue that this can be easily achieved by machines with more processing power or switching to cloud servers.
However, all of the above is just the tip of the iceberg—the most challenging part for my team always comes when the model is implemented in production. On such a large scale, how could we calculate and continuously monitor features to ensure that their statistical properties remain relatively stable? On top of that, as more observations keep being added, how could we store these features to ensure that our model is properly trained on this ever-increasing dataset?
To add another layer of challenge, once Model #1 is in production, and the next model comes along, my team needs to do large-scale feature processing again because it’s embedded in the previous training jobs. You can imagine how much duplicated work this ML infrastructure causes!
With all of these data feature challenges in mind, my team started to explore the possibilities with (large scale) feature stores, a concept that was introduced by Uber’s Michelangelo platform in 2017. A feature store is referred to as a centralized operational data management system that allows data teams to store, share, and reuse curated features for ML projects.
It’s widely accepted that there are two types of ML features: offline and online. Offline features are mainly used in offline batch operations and commonly stored in traditional data warehouses. Online features, on the contrary, need to be computed and served in (near) real-time and stored in a key-value format database for a rapid value lookup. Large-scale feature stores, such as Redis, have the capability of supporting both (offline) batch processing and (online) low-latency serving.
From the perspective of model monitoring, feature stores maintain all historical data along with the most up-to-date values in a time-consistent manner. This makes comparing feature values and model performance as easy as an API call.
Additionally, feature stores also help minimize work duplications in the sense that all available features stored in this centralized location are easily accessible to internal data teams within an organization. Some data science experts even say that the next few years will be the years of the feature store.
Challenge #2 – Data science programming languages can be slow and choosing the right processors is critical
Once features/input predictors are ready to be accessed and used (probably in a feature store), data science teams will kick off the model development process.
Most ML applications are written in Python or R, the most popular languages among data scientists. For our team and probably plenty of others, we often build models with a combination of multiple languages.
As much as I enjoy coding with Python and R, I have to note that large-scale ML models are rarely productized in these languages because of speed considerations. In contrast to Python and R, faster production languages like C/C++ or Java are arguably preferred (same as in a software development scenario). Hence, wrapping a Python or R model into C/C++ or Java poses another challenge for our team.
Beyond programming languages, there’s the requirement for powerful processors to perform iterative, heavy computational operations. This makes hardware one of the most critical decisions to make in ML scalability, especially focusing on fast executions of matrix multiplications.
In one sentence: as far as large-scale ML applications go, Java might not be the best language, and CPUs may not be optimal!
Java is known to be the top ML production language primarily because practitioners believe that it performs faster than Python or R. However, this isn’t always the case. ML packages in Python or R are optimized to integrate wrapper functions over C/C++, which enables the computations to be faster than in native Java.
Along the same line of computational speed, from my experience, CPUs are far from optimal for ML models at scale due to their sequential processing nature. GPUs can be a great choice for computations that are likely to benefit from parallel processing.
Another advancement is the ASICs (Application Specific Integrated Chips), for example, Google’s TPUs (Tensor Processing Units). TPUs are specifically designed to perform matrix multiplication with no access to the memory, in order to reduce computation complexity and save costs. In fact, an easy TPU initialization is a partial reason why my team has been sticking with Google’s Tensorflow framework, even considering the infamous compatibility issues of different versions of Tensorflow.
Challenge #3 – Analysis data is too large and/or ML algorithms are too complicated to handle for a single machine
One of the projects my team has worked on was to develop a large-scale image classification algorithm. For projects like this one with data and models at scale, the traditional way of conducting ML on a single computer no longer works. This leads to the solution of distributed ML, which can work with big data and complex algorithms in a computation- and memory-scalable way. The idea behind distributed ML is intuitive: if our algorithms can’t be completed on a single node, we can distribute the computational workload among multiple nodes.
A related concept to distributed ML is parallelism or parallel processing, which refers to both data (processing) parallelism and algorithm (building) parallelism. By partitioning data or ML models and allocating the computational work to different workers, we can leverage multiple cores at the same time to get accelerated performance. In fact, in our image classification project, we were able to gain a 10x speed improvement with distributed ML algorithms.
One of the most popular open-source distributed ML frameworks is Apache Hadoop, which provides MapReduce API in several programming languages. Another popular one is Apache Spark, which can be used to perform faster in-memory computations.
Implementation-wise, most mature ML frameworks, such as Tensorflow and PyTorch, offer APIs to perform distributed computations for your teams to get started.
Generally speaking, distributed ML works effectively, especially with data parallelism. With respect to the algorithm/model parallelism, however, it may not work as well.
Algorithm parallelism, under the context of ML applications, is mainly about matrix manipulation. For multiple workers to compute at the same time, we need to separate a huge matrix into a few smaller parts, assign several to each worker, and finally collect the results back through communication among workers. The last step is referred to as synchronization. Under certain circumstances, synchronization may take a longer time than the distributed computation itself because workers often don’t finish at the same time.
Because of that, linear scalability (the ability to get the same percentage increase in performance as the percentage of added resources) almost never happens in real ML projects (at least from my experience).
You might want to spend extra time and energy to check the output after synchronization to ensure the correct config of your distributed systems and convergence of your ML applications.
Challenge #4 – Framework versions and dependency problems with ML deployment
Ask any data scientist who has launched large-scale ML applications what their biggest challenges were. Chances are that package versions and dependency hell will be one of them, and our team is no exception.
When we deploy an ML application, first and foremost, we have to make sure that the productized model consistently reproduces its training counterpart. This is particularly crucial when the ML models involve huge data and/or complicated algorithms, as they could take longer to train and be harder to debug if anything goes wrong.
For our team, package version and dependency issues were the top reasons why our ML models failed the moment they ran in production. Additionally, because of the requirements on package versions, we used to only host one productized model on each virtual machine.
If you ask software developers what their best practices are in terms of writing large programs, chances are that splitting your code into chunks is one of them.
Applying this idea to ML deployment, I would recommend leveraging Docker containers and bundling everything model-related into a single package that runs in the cloud.
It’s beyond the scope of this blog to discuss how to scale up ML using Docker. Nonetheless, it’s worth emphasizing that Docker containers let my team run large-scale work within the same container environment, and deploy multiple application instances on the same machine.
As a side-note / tip, all Docker images aren’t created equal. Google’s Tensorflow Docker images are highly performance-optimized in the sense that PyTorch runs faster in Tensorflow images than in their own PyTorch images. This may be useful when you’re selecting data science framework(s) for your next large ML application.
Kubernetes vs Docker – What You Should Know as a Machine Learning Engineer
Another infrastructure that we’re currently exploring is microservices, an architecture that makes ML application deployment accessible. We’ll be creating an individual microservice for each session of the ML pipeline, and these microservices can be accessed through RESTful APIs. This will make our large-scale applications more organized and manageable.
Challenge #5 – Large ML application is never a one-time deal, it requires efficient optimization methodologies to fine-tune or retrain through iterative processes
We’ve built a great model and launched it in production. Now it’s time for us to pop the cork on a bottle of champagne and celebrate the mission accomplished, right? Well, not yet, unfortunately.
This is just the first step of a long journey! Our application will need frequent fine-tuning and retraining as our organization acquires new data over time (after model deployment).
To address this issue, my team came up with a solution to implement automatic model retraining when data drifts (changes in statistical properties of samples) happen. For large-scale ML applications, model tuning or retraining can be challenging due to the high cost of time and computation; not to mention that it’s expected to be done on a regular basis.
Admittedly, the core component of ML model retraining lies in hyper-parameter tuning. Therefore, it would be beneficial to select and set up an efficient hyper-parameter tuning algorithm. It will enable us to achieve high-quality models using less compute time and cost, which is one primary goal for ML applications at scale.
May interest you
In terms of hyper-parameter searching methodologies, there are a few options: random search, grid search, and Bayesian optimization. From my experience, for ML application at scale, Bayesian optimization outperforms grid search and random search; the larger the dataset/parameters grid, the higher the potential efficiency gain.
This shouldn’t be surprising because Bayesian optimization is basically a smart search strategy, where the next hyper-parameter is selected in an informed way. It helps reduce the searching time for the optimal hyper-parameter combinations.
Features are the hardest component to get right
Feature stores provide support for feature storage, serving and management for ML scalability
Data science programming languages can be slow and choosing the right processors is critical
GPUs and TPUs are better choices for parallel processing and complex matrix manipulation over CPUs
Analysis data is too large and/or ML algorithms are too complicated to handle for a single machine
Distributed ML works effectively by leveraging multiple workers
Framework versions and dependency problems with ML deployment
Docker containers keep ML applications encapsulated, and thus consistent across all deployment environment for easy scalability
Large ML application is never a one-time deal, it requires efficient optimization methodologies to fine-tune or retrain through iterative processes
Bayesian optimization offers a computational-efficient benefit to scale up ML hyper-parameter tuning
As a bonus tip, scaling well-performed ML applications across the entire organization can never be achieved through silo working. It requires a connected and collaborative effort. The involvement and communication among data science, data engineering, data architect, DevOps, and all other relevant teams is crucial in identifying potential risks across different stages of ML applications. In addition, it’s equally important to define roles and responsibilities upfront to avoid duplicated workloads across the ML pipelines.
Scaling up ML applications is not easy! Hopefully, my experience and lessons in this article will make your journey to large-scale ML applications a little easier. Thank you for reading!