What is machine learning and its life cycle about? You’ll get a different answer from different people.
- Programmers might say that it’s about programming with Python, and sophisticated mathematical algorithms.
- Business stakeholders usually associate machine learning with data, and a dash of mystery.
- Machine learning engineers tend to talk about model training and data wrangling.
So who is right? Everyone.
Machine learning is about data – no lie there. There’s no machine learning without a decent amount of data for the machine to learn from. The amount of available data is growing exponentially, which makes machine learning development easier than ever.
The connection between machine learning and algorithms is also on point. Indeed, there are complex mathematical methods that force machines to learn. No math – no machine learning.
Lastly, model training and data preparation is indeed the core of every ML project. Machine learning engineers spend a substantial amount of time training models and preparing datasets. That’s why it’s the first thing ML engineers think of.
Machine learning is about development, manipulating data, and modeling. All of these separate parts together form a machine learning project life cycle, and that’s exactly what we’re going to talk about in this article.
High-level view of the ML life cycle
The life cycle of a machine learning project can be represented as a multi-component flow, where each consecutive step affects the rest of the flow. Let’s look at the steps in a flow on a very high level:
- Problem understanding (aka business understanding).
- Data collection.
- Data annotation.
- Data wrangling.
- Model development, training and evaluation.
- Model deployment and maintenance in the production environment.
As you can see, the entire cycle consists of 6 consecutive steps. Each step is unique, with its own nature. These differences lead to variations in resources, time and team members needed to complete each step. Let’s have a detailed look at each component in the life cycle, and see what it’s all about.
ML life cycle in details
Step 1: Problem understanding
Each project starts with a problem that you need to solve. Ideally, a clear problem definition should be numerically described. Numbers not only provide an ability to know where your starting point is, but also let you track the effect from the changes later on.
For example, the company where I work has calculations that show how much each manual operation costs to the business. This approach helps us stratify our operations, and prioritize them based on how much we need to spend.
Our management recently kicked off a new machine learning project, aiming to bring automation to a particular manual operation that’s currently on top of our spending list. The team has also done research, benchmarking the costs of this operation to our competitors.
The result turned out disappointing: similar manual operations are up to 20% less expensive for other companies in our industry. In order to successfully compete in the market, we must drive our costs down. That’s why we launched the automation project.
Is this all there is to our problem? Not quite. Knowing the costs doesn’t mean that we can hand this problem to our machine learning team and expect them to fix it.
So far, we’ve only defined the problem in business terms. Before any machine learning happens, we need to move from monetary units and switch to other KPIs that our machine learning team can understand.
To do that, our management figured out that if we want to decrease the costs for a given manual operation by 20%, we should decrease the number of manual processing from 100% to at least 70%. This means that 30% of all operations should be processed automatically. Knowing that can help us narrow down the scope for a project, letting us understand that we only need to target a portion of a problem, not the whole problem.
Next, the manual operation we wanted to target was decomposed into pieces. Knowing how much each piece costs in terms of time (and money), the team was able to come up with a list of proposals for the tasks that might be automated.
Discussing this list with the machine learning team, they picked a few tasks that can be solved via supervised machine learning algorithms if proper data is available.
Finally, the problem understanding is complete: each team in the company knows what they’re targeting and why. The project can begin.
Step 2: Data collection
Data is power. When the problem is clear, and an appropriate machine learning approach is established, it’s time to collect data.
Data can come from multiple sources. You might have an internal database that can be queried for relevant data. You can ask data engineers to extract the data for you or use existing services like Amazon Mturk or perhaps do it yourself.
Others receive data from their clients. This is typically the case when you work on a client’s problem side-by-side. The client is interested in the end result, and is willing to share data assets.
Another option to consider is buying data from third-party providers. Nielsen Media Research is a good example. It focuses on the FMCG (fast moving consumer goods) market. They do plenty of research, collecting data from different market populations. Companies that sell fast moving consumer goods are always learning about their customers and their preferences, in order to ride emerging trends into profitability. Third-party providers like NMR can be a great source of valuable data.
There are also open-source datasets. They’re especially handy if you work on a general problem that many business and industries might also have. There’s a big chance that a dataset you need is already somewhere on the web. Some of the datasets come from government organizations, some are from public companies and universities.
What’s even cooler, public datasets usually come along with annotations (when applicable), so you and your team can avoid doing the manual operations that take a significant amount of project time and costs. Consider these articles as a guide that will help you find the right publicly available dataset for your project:
- Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice.
- Where to Find the Best Machine Learning Datasets.
- 25 Excellent Machine Learning Open Datasets.
Your goal is to collect as much relevant data as you can. This usually implies getting data for a wide timespan if we talk about tabular data. Remember: the more samples you have, the better your future model will be.
Later in the life cycle, you’ll go through the data preparation step, which might remarkably reduce the number of samples in your dataset (I’ll explain why in a bit). That’s why it’s crucially important now, at the very beginning of the project life cycle, to accumulate as much data as you can.
If you haven’t collected enough, you have two alternative options to go with:
Data augmentation will introduce extra variations to the existing dataset, making the model better at generalization. It doesn’t really add more samples, it just manipulates the current data to make the most out of it.
From personal experience, I can say that you should carefully consider the types of data augmentation that you apply. You only need to look for augmentation that reflects the real production environment the model will be used in. There’s no need to “teach” the model to be prepared for the cases that you surely know will never happen in real life. Later in this article, we’ll cover exploratory data analysis (EDA), which can reveal what kind of data you work with and what type of augmentation is appropriate.
Synthetic datasets, on the other hand, are new samples that can be used as inputs to your model. This is completely new data that you can artificially generate using either unsupervised deep learning (e.g. Generative Adversarial Networks), or libraries that work with images (e.g. in Python you can think of OpenCV or PIL).
Generative Adversarial Networks (GANs) generate new examples from existing ones. As a great example, we can refer to the computer vision industry, where engineers use this architecture type to create new unique images from existing, usually small, datasets. I personally can say that images generated by GANs are quite good in quality, and are quite useful for annotation (step 4 of the ML project life cycle) and further neural net training (step 5 of the life cycle).
Step 3: Data preparation
Collected data is messy. There are many problems that machine learning engineers face when dealing with raw data. Here are the most common issues:
- Relevant data should be filtered. Irrelevant data should be cleaned up;
- Noise, delusive and erroneous samples should be identified and removed;
- Outliers should be recognized and eliminated;
- Missing values should be spotted and either removed and imputed by proper methods;
- Data should be converted to proper formats.
As you can see, there are multiple issues that a machine learning engineer can face when dealing with raw data. Each dataset is unique in terms of the problems it brings. There’s no rule of thumb on how to approach data preprocessing. This process is creative and multifaceted.
Let’s consider missing values. It’s quite a common issue that most of the datasets have. ML engineers can simply drop these values and only work with the valid records in the dataset.
Alternatively, you can go with imputation and fill up records with NaNs. Looking for a rule of thumb again? Unfortunately, there is none here as well. Imputing can be done in multiple ways, based on different criteria you selected. Mathematical algorithms for imputing also differ, and again you have multiple options to consider.
Creating new features from existing ones is another option that machine learning engineers should consider. This process is called data engineering. A great example of data engineering that I personally do quite often is dimensionality reduction via principal component analysis (PCA). PCA reduces the number of features in the dataset, keeping only those that are the most valuable for future decision making.
As soon as you finish the core part of data preparation, you might want to move to data processing. Data preprocessing is a step that makes your data digestible for the neural net or algorithm that you’re training. It typically implies data normalization, standardization, and scaling.
Generally, a data preparation step comes along with exploratory data analysis (EDA) which complements the overall preparation process. EDA helps engineers get familiar with the data they work with. It usually implies building some plots that can help with the data from different perspectives. Intuition that engineers get from such an analysis helps later on in finding the right methods and tools for data preparation, model architecture / algorithms selection (step 5.2 of the ml project life cycle) and, of course, proper metrics choice (step 5.4 of the project life cycle).
Data preparation (aka data wrangling) is one of the most time consuming steps, yet one of the most vital ones, since it directly affects the quality of the data that will go to the net.
I usually end up with a data preprocessing step by splitting the processed data into three separate subsets: data for training, validation and testing. For small datasets I allocate no more than 30 % for both validation and testing, allocating the rest of the data for training. For big datasets (greater than 10k, I’m a computer vision engineer so I work with images) my personal practice has led me to the following split ratio:
- 10% for the tests,
- 10 % for the validation during training,
- 80% for training.
The split strategy that I highly recommend is stratified split, which helps to keep the proportion of classes in each dataset equal. It’s important for proper performance estimation.
Step 4: Data annotation
In case your work is in the supervised learning domain, you will need a label for each sample in your dataset. The process of assigning labels to data samples is called data annotation or data labeling.
Data annotation is a manual operation, which is pretty time consuming and quite often performed by third parties. There’s a rare case when machine learning engineers themselves work on labeling. Given the fact that you and your team will most probably not go over the annotation process by yourself, your main goal at this step is to design a comprehensive annotation guideline.
The guidelines will let annotators know what to do. That’s why it’s crucially important to come up with a well-rounded guide that will cover the most essential aspects of the annotation job. Don’t forget about the edge cases that might occur during labeling. Your annotation team should be prepared for every possible scenario they might face. There’s no place for assuming in the annotation job. Everything has to be clear and transparent. You should also assign a person who will assist the annotations team. If you can’t process a particular example, annotators should know who to contact to address their questions.
Might interest you
There are some great examples that you can use to create your own annotation guidelines. Consider reading this research paper if you’re curious how annotation can impact the overall machine learning life cycle. Keep in mind that the quality of your data annotation directly affects how your end model will perform. Don’t limit the time working on the annotation guidelines. Make it easy to use and detailed enough. Examples are always helpful, and usually very welcome by annotators. The time spent on guidelines annotation is an investment towards the quality of your end result.
Step 5: Modeling
By this step you should have a complete dataset that’s ready to be fed into the model. What’s next? It’s time to make a decision about the future model, and assemble it.
5.1. Try to solve your problem via transfer learning
Machine learning engineers don’t create models from scratch. They tend to reuse the models that have already shown decent performance on big public datasets. These pre-trained models can be used for fine tuning. This approach has been widely established in deep learning. In computer vision, for example, fine tuning works well because the low level features that CNNs extract are unified for a broad range of tasks.
The places where you can find public pre-trained models are called model ZOOs. Github is a great source of pre-trained models with hundreds of possible options available. You just have to search for the model of a given architecture and framework you work with.
For example, TensorFlow is a machine learning framework that provides an opportunity to import pre-trained models. Here’s an example of a zoo with detection models, created by TensorFlow. These models can be used for transfer learning in computer vision.
You should always look for a pre-trained model for your project to start working with. It will save your time, computational resources, and even improve the quality of the end result.
5.2. Tune the model architecture accordingly
It’s important to note that a pre-trained model that we import needs to be modified to reflect the specific task we’re doing.
If you’re in computer vision, you probably remember that the number of classes that a classification model can identify depends on the top part of the model architecture. The last dense layer should have a number of units that’s equal to the number of classes you want to distinguish. Your job is to prepare a final model architecture design that’s suitable for your goals.
5.3. Experiment a lot
Machine learning engineers tend to experiment quite a lot. We love playing around with multiple model configurations, architectures and parameters. You probably won’t accept the baseline result you got and move it to production. This outcome rarely becomes the best possible one. An iterative training process to find the best model configuration is a common practice among machine learning engineers.
At this point, you should give a shot to multiple alternative hypotheses that can potentially work for a task you have. To narrow down the list of possible options, you might consider using the hyperparameter tuning methods that most ML frameworks provide. These methods estimate performance for multiple configurations, compare them, and let you know about the top performing ones. You just have to specify values and parameters to be sampled.
Experiments are in the nature of what we do. If your computation resources aren’t limited, you should definitely take advantage of it. The outcomes that you might get can be quite unexpected. Who knows, maybe a new state-of-the-art model configuration will come from one of your experiments.
Might be useful
If you run a lot of experiments, check how you can track and organize them nicely.
5.4. Evaluate properly
Evaluation always goes in conjunction with doing experiments. You need to know how each model behaves in order to select the top performing one. To compare models, a set of metrics needs to be defined.
Depending on the problem you’re working on, your set of metrics will be different. For regression problems, for example, we usually look at MSE or MAE. To evaluate a classification model, on the other hand, accuracy might be a good choice for a balanced dataset. Imbalanced sets require more sophisticated metrics. F1 score is a good metric for such cases.
Evaluation during training is performed on a separate validation dataset. It tracks how good our model is at generalization, avoiding possible bias and overfitting.
It’s always good practice to visualize model progress during the training job. Tensorboard is the first and most basic option to consider. Alternatively, neptune.ai is a more advanced tool that visualizes model performance over time, and also does experiment tracking. Having the right tool is essential. Take your time to find an experiment tracking tool that fits your particular needs. You will save a ton of time and improve your overall workflow when you get one.
Step 6: Model deployment
Excellent! You’ve got a brilliant model, ready to go to production. Now, engineers deploy a train model and make it available for external inference requests.
This is the last step in the machine learning life cycle. But the job is far from over, we can’t just relax and wait for a new project.
Deployed models need monitoring. You need to track deployed model performance, to make sure it continues to do the job with the quality that the business requires. We all know about some negative effect that might happen over time: model degradation is one of the most common ones.
Another good practice would be to collect samples that were wrongly processed by the model to figure out the root cause reasons for why it happened and use it for retraining the model making it more robust to such samples. Such little continuous research will help you better understand possible edge cases and other unexpected occurrences that your current model isn’t prepared for.
By now you should have a solid understanding of the entire machine learning project life cycle. Let me highlight again that each consecutive step in a cycle might drastically affect the following steps, both in a positive and negative way. It’s essential that you go through each step carefully.
My personal practice has shown that step #2 (data collection), step #3 (data preparation) and step #4 (data annotation) are the ones that require the most time.
The quality of the data that goes into your model is a key driver of a good model. Don’t neglect these steps and always invest enough time and resources into them.