ML and data science is evolving rapidly. In the past few years, there has been a lot of progress in terms of integrating with different technologies and business workflows.
If you’re working on a business ML project, you might do many things before and after calling .fit() and .predict(). In this article, we’ll discuss the different aspects of an end-to-end project, where you can increase the chances of your project success with a few simple steps.
When to use ML?
Before creating a model, it’s a good idea to ask ourselves:
“Can this problem be solved without ML?”
Meaning, can a rule-based system work well enough to solve the problem?
A rule-based approach has unique characteristics. Rules are transparent and easy to debug. You always know why a prediction went wrong. This information comes in handy where interpretation and transparency of decisions are essential.
However, rule-based systems are hard to maintain. Writing rules can seem simple initially, but as the product develops, it gets messy. It’s crucial to think about the maintenance of rules, whereas an ML system can be easily retrained on a new dataset.
For example, if you’re building invoice parsing software, initially you might focus on specific invoice templates. It’s easy to write rules there because the document structure is almost consistent. However, once you grow and process more invoices, an ML-based approach makes more sense because it scales easily. You just need more data and/or a more complex model, instead of more engineers to write more rules.
Checklist before starting an ML project
Before you even write your first line of code, you should have a good idea about how your progress will look. Here’s a checklist that could help:
- What is the goal of your ML project? Defining clear goals for the project helps focus. For example, you could be building an ML model to automate credit card approval. The purpose should be as specific and transparent as possible.
- Do you have the data? I can’t stress this enough! Spend more time gathering high-quality data than implementing state-of-the-art models. In practice, whenever your model performs poorly, it’s probably due to a data issue. Your company might have no data whatsoever, like some early-stage startups for example, so you fall back on third-party ML solutions till you acquire the data and train your models. Even when you have the data, it might not be annotated, and annotating will take a while.
- How will you measure the model performance? Of course, while training your model, you will keep track of a metric like accuracy, precision, or any other metric of interest. However, remember that these numbers are not the ultimate goal. Your business metrics are derived from the project objective. For example, the Netflix recommendation engine is an ML model that increases the time spent watching per user or reduces the number of visits where a user watched nothing and closed the website. There are two levels of metrics, model metrics and business metrics.
- Is the infrastructure really in place? Machine learning is 10% machine learning and 90% engineering.
As the figure above shows, actual ML code is just a tiny part of the whole system. Before the ML code, we need to think about data processes. It’s not just about collecting data and storing it in a place. The system has to be highly fault-tolerant. For example, if you’re using a cloud service to collect data and the service fails, there should be a rescue mechanism to avoid data loss.
Also, ML models need frequent updates depending on the use case. It’s not good practice to have developers spend a lot of time on each update. So, it’s beneficial to have systems and tools to save time and be more transparent throughout the process. One such way is to use a good model registry that allows us to store and manage multiple trained models. There are open-source model registries like MLflow and ModelDB. For more sophisticated and collaborative solutions, checkout neptune.ai.
Learn more
Read how you can version, store, organize, and query models, and model development metadata with Neptune.
From an MLOps perspective, it is a good practice to containerize your ML inference code using Docker and deploy it using Kubernetes. Containerizing applications allows you to worry less about setting up the right environment and enables you to run your application anywhere in the cloud, consistently.
- Deployment requirements. It’s often a good idea to think about the deployment strategies in advance. Is a cloud deployment preferable, or an edge deployment? What are the latency constraints? Thinking about these questions often helps decide how large or complex your model should be.
A poor deployment strategy can degrade the user experience. For example, if you’re generating restaurant recommendations for a user, doing it as soon as she/he opens the app can take a lot of time. The user might just quit the app. In such cases, one could maybe generate recommendations overnight, when the user is inactive, and cache them on his/her device.
- Is explainability required? The general goal behind any model is to get the best metrics. Usually, this comes at the cost of explainability. The more complexity, the less we know about what’s going on inside the model. However, we care about why the model made a particular prediction in some cases. For example, in automated loan application systems, the applicant would want to know why their loan application was rejected. In such cases, a simpler model with interpretable parameters might be preferred.
Checklist after training an ML model
You thought about all the questions in the above sections and trained a model. So now what? Do you just deploy it?
Not so fast. While the data you trained the model on was carefully curated, there are ways your model hacks through to a high metric. For example, consider an image classifier that predicts if an image is a panda or a polar bear. You trained a model with near-perfect accuracy.
Model prediction (image below) – Panda
Model prediction (image below) – Polar bear
Model prediction (image below) – Polar bear
The first two predictions make sense. What about the third one? It should be a panda.
After scrutinising the model, it turns out that the model is not a panda-polar bear classifier, but a snow-grass classifier. It has mostly seen pandas in a jungle with green surroundings and polar bears in the snow.
Similarly, data leaks in NLP are common. If you’re building a sentiment classifier, make sure that the model is learning from words representing sentiment, and not from stopwords or any other irrelevant words. Libraries like LIME and Shapley help in explaining model predictions.
But how do you assess an unsupervised model? Consider a sentence embedding model. The goal of the model is to generate a high score for a sentence pair with semantically similar sentences, low scores for dissimilar sentences, and learn all of it in an unsupervised or self-supervised setting. In such cases, it’s good to have a custom test suite, like hand-curated pairs of positive (similar) and negative (dissimilar) examples. These can be used for any model as ‘unit tests’. Design multiple levels of tests. Easy ones check the model sanity, for example:
(It’s good ice cream, Political leaders must unite, 0)
(It’s good ice cream, This dessert is awesome!, 1)
A/B testing
Once your model is ready and sane, it’s almost time to release it in the wild. Remember that your model was built and tested in a local, restricted environment. But things could go a little different in production. Hence, it’s vital to do a test run first, and then a full-fledged deployment.
In simple words, A/B testing is used to assess two options of a variable in a randomised setting. In our movie recommendation example, we’d randomly pick a few thousand users and deploy the new ML model only for those users. After a few days, we compare the new model with the old in terms of recommendation acceptance rate, total time spent, etc. If the new model seems to perform better, we deploy it for all users.
A good A/B test has many design considerations depending on the objective of the test.
Measuring the model performance in the wild
You have labels while training. But what about churning model predictions in a live environment? It’s likely that you don’t see the labels there. So how are you supposed to measure the performance of a live model?
Before moving into it, why do we want to measure a model’s performance after the deployment? Didn’t we test it on a well-sampled test set? Yes, we did that indeed. However, the training data distribution could be drastically different from the one the model sees in a live environment. For example, tweets generated before COVID and during COVID might have a different distribution in the vocabulary, sentiments and topics. Hence, the model is bound to make more mistakes on the data it hasn’t seen much of. This is called data drift.
Read more
There are a few proxy metrics you can track that show whether feature distributions have changed. Since a trained ML model is deterministic in prediction, a change in data distribution is reflected by a change in feature distribution. Furthermore, a change in prediction label distribution could also be an alarm. If the training data labels saw a 40-60 distribution and the model predicted labels with a 20-80 distribution, then you might want to inspect the model.
In some cases, it’s possible to get weak supervision. In chatbots, for example, after a few system-generated messages, the system could ask if the conversation helped the user. Asking for such validations too frequently harms the user experience. Similarly, in search or recommendations, accepting the results by clicking on them can be feedback to the system. Such systems can be re-trained online.
After you detect degradation in model quality, the next step would be to restore its performance by retraining it on the new data. But there are some considerations before retraining:
How frequently should you retrain the model? Depends on the frequency of data shift you expect. Remember that retraining has a cost associated with it, especially when retraining large models.
Should you retrain only the last few layers of a deep learning model or the entire model? How much of your old data and how much new data should you use for retraining?
These choices mainly depend on your use case.
Read also
How to Track Machine Learning Model Metrics in Your Projects
The Ultimate Guide to Evaluation and Selection of Models in Machine Learning
Bias and fairness
Modern deep learning models are black-boxes. We don’t know what they really learn. A model generally assigns more importance to a feature that gets it to a minimum loss. The problem arises when a model gives more prominence to certain words or features that induce a systematic prejudice.
Consider word2vec, probably one of the few famous early language models. The word vectors can represent language semantics in simple vector algebra. For example:
King — Man + Woman = Queen
The same model also shows this relationship.
Computer-programmer — Man + Woman = Home-Maker
Do you notice the bias?
A while back, google image search results for the term ‘nurse’ had an unusually large number of images of women.
Another infamous example is Amazon’s ML-based recruiting system. Amazon designed an algorithm to spit out the top 5 CVs for a job from hundreds of them. It was found that the system had a bias against women applicants and rejected them more often than male applicants.
Such systems create a bad reputation for the company and a substantial financial loss. If a bank uses ML to accept or reject loan applications, the model could have a systematic bias against certain races. This clearly means that the bank loses potential customers along with the revenue it would create.
Bias treatment strategies
Machine learning can be used to remove biases from machine learning models. Such methods belong to two categories – Using ML to remove biased signals and using ML to include signals that decrease the bias in the dataset. An ML model is said to be biased with respect to an additional variable z if including z in the dataset changes the model prediction.
There are algorithms designed to detect bias in ML models. Packages like IBM AI Fairness 360 provide an open-source implementation of these algorithms. Besides, interpretability methods like LIME and SHAP also help understand the decision-making process of an ML model.
Another package called FairML uses a simple idea that works for black-box models well. Since a model takes a vector or matrix of features and predicts an outcome if changing a particular feature changes the outcome of the model drastically and consistently, then the feature could introduce some bias. FairML assesses the relative significance of input features to detect biases in the input data. According to the authors of the package:
“FairML leverages model compression and four input ranking algorithms to quantify a model’s relative predictive dependence on its inputs.”
More recently, deep learning approaches to de-bias models have become popular. In one such method, to rid the classifier of bias from a sensitive variable z, the model is asked to predict the task label y as well as the variable z. However, during backpropagation, the gradient with respect to z is negated. So instead of using the gradient from z to reduce the loss, the model now learns to depend less on z.
So, there are solutions to these problems, and you should look into them to make sure your model is fair.
Conclusion
In summary, developing an ML solution for a business use case is a multi-variable problem.
What we discussed in this article is the tip of the iceberg. The challenges of your project will vary depending on the problem you’re solving. Take some time to think about your requirements, and then proceed to execution. Planning in advance may seem time-consuming in the beginning, but will save you much more time in future.