MLOps Blog

How to Test a Recommender System

11 min
Dhruvil Karani
14th November, 2022

Recommender systems fundamentally address the question – What do people want?

Although it is an extensive question, in the context of a consumer application like e-commerce, the answer could be to serve the best products in terms of price and quality for a consumer. For a news aggregator website, it could be to show reliable and relevant content.

In a case where a user would have to look through thousands or millions of items to find what they are looking for, a recommendation engine is indispensable. According to an article from lighthouselabs.ca on Netflix’s use of Data Science:

The engine filters over 3,000 titles at a time using 1,300 recommendation clusters based on user preferences. It is so accurate that personalised recommendations from the engine drive 80% of Netflix viewer activity.

However, building and evaluating a recommender system is very different compared to a single ML model regarding design decisions, engineering, and metrics. In this article, we will focus on testing a recommendation system. We will also talk about:

  • 1 Types of recommender systems
  • 2 Overview of the most popular model – collaborative filtering

Types of recommender systems

Recommender systems work on three main paradigms:

  1. Similarity-based on the query content: the system retrieves content based on the similarity. For example, if you like a football video, it will show you another one. Alternatively, if you search for a blue T-shirt, it will show you more blue T-shirts. The match is based on the item content like image, description, title, etc.
  2. Wisdom of the crowd: modern recommender systems used in social media are based on this. If user A likes movies X, Y, and Z, user B likes movies X and Z; then user B would probably like movie Y. Instead of relying on item content, these recommendation models consider user preferences. These models are popular because they go beyond topic and content. They can thoroughly recommend a baseball video to a sports-loving user who just liked a football video.
  3. Session-based: session-based systems capture a user’s intent in a specific session and recommend items based on the session-level contextual information. For instance, if you are shopping for a new workstation and intend to buy a monitor, keyboard, mouse, chair, etc. you want the website to show you items that pertain to setting up a workstation in this session, even though you might have liked a certain book earlier.

Read more

Recommender Systems: Machine Learning Metrics and Business Metrics

The second and third require a lot of user-item interaction data. If that is not available, one might start with the first type of recommender system. Even when there is a lot of data for existing users, one might not have enough for a new user. This situation is called the cold-start problem in recommendation systems. In such cases, content-based recommendation systems can be a good proxy until one has enough interaction data for new users. 

Enough with the overview, now let’s take a brief look at one of the popular recommender systems and see how we can test it out.

Overview of collaborative filtering based models

Collaborative filtering is one of the most popular battle-tested recommendation models. The goal here is to train a vector representation of items and users such that the probability of a user U with a representation (embedding) Vu preferring an item I with representation (embedding) Vi is

Overview of collaborative filtering based models
Collaborative filtering is one of the most popular battle-tested recommendation models
Collaborative filtering is one of the most popular battle-tested recommendation models | Source: Author

How does the model training take place?

For M unique users and N unique items in the dataset, we create an embedding table with dimension as D. We have D*(M+N) parameters to learn. Let’s say we are building this system for YouTube and want to predict if a user will press the like button on the video or not. Our training data will have billions of pairs like (userId, postId) if the user with that userId has liked the video with that postId.

Train/test split under the case of recommendation
Train/test split under the case of recommendation | Source

We initialise the embeddings randomly. Then, during training, we compute the probability with the label as 1 and cross-entropy loss. Doing this in batches over multiple epochs trains the user and item embeddings.

Model training in recommender systems
Model training in recommender systems | Source: Author

The training-validation split happens on a user level. Meaning, that every user’s X% of likes is in the training set and 100-X% in the validation set. X is usually 80-90%.

Learn more

Cross-Entropy Loss and Its Applications in Deep Learning

Recommender system: objective design

In the previous example, we trained a model to predict if a user would like a video on YouTube. The predicted variable was pretty straightforward and explicit. However, not all signals are explicit. For example, consider the predicted variable as to whether or not a user will watch 95% of the video by video length. If so, we include the  (userId, postId) in the dataset.

If we have a near-perfect model that predicts a probability of >95% watch, we can say that we are recommending videos that the user likes, right?

Here is the catch – consider a one-minute video (V1) vs a thirty-minute video (V30). It takes 57 seconds to watch the 95% of V1, and it takes 1710 seconds to watch 95% of V30. V1 could also be a clickbait video, while a user can like V30 and still watch just 1600 seconds of it. So does our definition assure that the positive labels represent user preference?

Secondly, most platforms have multiple signals – like, share, download, clicks, etc. Which objective should one use to train the model? Usually, one is not enough. Let us say we train multiple models based on different objectives. We have multiple (userId, postId) scores from each model. Then a single number score is created based on an aggregation formula on all scores, which is used to create the final ranking.

The point is that if the training objective is not carefully designed, even a near-perfect model will not give good recommendations.

Evaluating recommender systems

Offline evaluation

Training a recommendation model offline on your local machine cannot give you the certainty of its online performance. However, there are some metrics to analyze the expected model behavior.

ROC-AUC

Receiver Operator Characteristic or ROC curve measures the true positive rate (TPR) on the Y-axis and the false positive rate (FPR) on the X-axis. For a binary classifier, we use a threshold above which the instance is predicted as positive, or else negative. For a particular threshold,

TPR = % of total positive above the threshold = TP/(TP + FN)

FPR = % of total negatives above the threshold =  FP/(FP + TN)

At threshold=0, all examples are classified as positives. Therefore, FN=0, since no example is classified as negative, and TPR=1. TN is also zero for the same reason. Therefore, FPR is also 1. This is the (1,1) point on the graph. 

At threshold=1, no example is predicted as a positive. Therefore TP, and FP both are 0, which represents (0, 0) on the graph. 

The curve is plotted by computing TPR and FPR for different thresholds in [0,1] and plotting them. The plotted curve looks like this:

AUC - ROC curve
AUC – ROC curve | Source

The area under the curve is one at max. The diagonal along x=y is the ROC curve if the classifier randomly assigned labels to the examples.

PR-AUC

Precision-Recall AUC or PR-AUC is similar to ROC-AUC, except on the Y-axis, we have precision, and on the X-axis, we have recall. Precision, as we know, is the fraction of correct positive model predictions. On the other hand, recall is the fraction of total existing positives that the model correctly classifies.

To understand PR curves better, consider a binary classifier. If we keep the classification threshold low, say 0.05, most examples are predicted as positives. All the existing positives will be correctly classified as positives. Still, we will have many false positives, as true negatives are also classified as positives, which leads to high recall and low precision.

PR curve
PR curve | Source

On the other hand, if we keep a very high threshold, most of the positive predictions made by the model will be correct, as the model is very conservative in what it calls a positive. However, we will miss out on many actual positives in pursuit of being correct all the time, which leads to high precision and low recall.

Notice that there is a trade-off between optimising the recall and precision jointly. Like ROC-AUC, a perfect classifier would have PR-AUC=1. The area under this curve is the PR-AUC.

However, a threshold will exist in a perfect classifier where the class separation is crystal clear. All the positive examples will be above this threshold, and all the negative examples are below. In such cases, the AUC is maximum and equal to 1.

PR-AUC’s significant advantage over ROC-AUC is that it does not mislead when there is a class imbalance. ROC-AUC can be higher than PR-AUC in case of imbalance.

See also

F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Ranking metrics

Apart from class separation, we want to get some sense of the rank order of the scores. The goal of recommender systems is not just to pick out relevant items but also to rank them according to preference. According to an article by Forbes:

The first page of Google captures 71% of search traffic clicks and has been reported to be as high as 92% in recent years.  Second-page results are far from a close second coming in at below 6% of all website clicks.

It does not help if you pick out relevant items but do not order them. So how do we test if our model has ranking capabilities?

  • Normalized discounted cumulative gains (NDCG)

Imagine a series of ten recommendations made by your model to a user. You want to see the best order of recommendations to get maximum likes. Below is the response of the user to the ten videos

1, 0, 0, 1, 0, 1, 1, 0,1, 0 …. (1)

The user liked the first, fourth, sixth, seventh and ninth recommendations. What could be the best case ordering here?

1, 1, 1, 1, 1, 0, 0, 0, 0, 0 …… (2)

Meaning had we recommended first, fourth, sixth, seventh, ninth and then the rest, we would have achieved the best ranking. Note that any permutation among first, fourth, sixth, seventh, and ninth would have yielded the equivalent rankings.

To calculate NDCG:

NDCG
NDCG
Normalised discounted cumulative gains (NDCG) | Source

reli denotes the relevance of item i – 0 or 1 in our case. p is the total number of items. For the lower ranks (lower i), the term under the sum carries more weight than the higher ranks. The IDCGp takes only the relevant items and calculates the sum, which is the maximum DCG score one can achieve from ranking all the relevant items at the top (expression 2) and the irrelevant items at the bottom. 

Notice that for irrelevant items, the numerator is 0 (20-1=0). DCGp calculates the score taking all the p (relevant and irrelevant) in the order that represents how our model would rank them after scoring the items (expression 1).

Note that NDCG lies between 0 and 1.

  • Recall@k

As discussed above, recall is the fraction of positives captured by the model out of the total existing positives. For a set of ranked recommendations, consider a particular ranking at position k. The number of positives present in the positions 1 to k divided by the total number of positives gives us recall at k.

For many systems, getting all the relevant results is essential, even at the cost of a few irrelevant results. In such cases, recall@k gives us an idea about the coverage.

  • Precision@k

Similar to recall@k, precision@k computes the precision of the model at a rank k. This means it computes the fraction of positives correctly predicted by the model divided by the total positive predictions.

For cases where one needs maybe not all, but only the correct results, precision@k helps quantify it.

Deeper dive

Recommendation systems are notorious for being biased. In our example about building a recommendation model for YouTube, we may find that overall AUCs are good. However, when we analyze these metrics on different levels, for example, AUCs for longer vs shorter videos, we find that longer videos have poor metrics, which means that the model is not learning to recommend longer videos well.

A similar effect can happen for any property – geography, user demographics, topics. Knowing where your model is doing good and where it is not is helpful.

Addressing the bias

Recommender systems often push popular content more than less popular long-tail content. Because popular content is more likely to be preferred by any random user. This allows the model to find a ‘quick-fix’ solution to minimise the loss. However, the user has many unexplored interests, or a user might like many not-so-popular things. Still, because they don’t occur as frequently in the training dataset, their embeddings are not learned accurately, leading to a bias. Imagine that only a few pop artists on Spotify would get close to 90% of plays among millions of artists on Spotify.

Recommender systems are trained in a loop. If the system recommends biased content to users, the following training will occur on these biased recommendations. Over time, the distribution skews towards popular items since feedback for such items is observed more than the other items—the bias compounds.

Why is addressing bias important? What’s wrong with recommending popular content? As mentioned earlier, this makes it hard to explore users’ other interests. In the short term, popular content may retain a user, but eventually, the user will find nothing novel on the app. Secondly, this makes it hard for new creators to gain traction on the app. New creators will have no incentive to create engaging, diverse content. Eventually, they may leave the app.

How do you measure bias?

One easy way is to see the views distribution. What percentage of views are captured by the top 1%, 5%, 10%,… Of videos, and how often are these videos recommended to users vs other videos. This 80-20 effect can be seen across topics (specific topics dominate the app), creators (few popular creators vs niche creators), etc. A machine learning model learns bias in the dataset, among other things. So if your dataset is biased, then chances are that your recommendation results will reflect it. 

Often, the model learns certain biases based on a feature implicitly. For example, not long back, if you did a Google image search for the word ‘CEO’ the top results would be pictures of white males. Similarly, results for words like ‘nurse’ would mostly be females. However, the word CEO is gender-neutral. 

According to an article from washington.edu:

In some jobs, the discrepancies were pronounced, the study found. In a Google image search for CEO, 11 per cent of the people depicted were women, compared with 27 percent of U.S. CEOs who are women. Twenty-five per cent of people depicted in image search results for authors are women, compared with 56 per cent of actual U.S. authors.

By contrast, 64 percent of the telemarketers depicted in image search results were female, while that occupation is evenly split between men and women.

One common way to measure the bias caused by an attribute/feature is statistical parity. Simply put, it measures the difference of the results (probabilities) of a model given the protected attribute p (gender for example) vs the results without it. An unbiased model would have:

How do you measure bias?

Having the extra information about p shouldn’t make a difference.

How to mitigate bias?

Creating fairer recommender systems is an active area of research. A popular strategy to tackle bias is negative sampling. In our YouTube recommender example, we have data for clicks. If we want to create a recommendation model based on click prediction, we only have click data from the videos that are biased by popularity. To balance this, we create samples by choosing random videos for users and assigning them the negative class. The idea is that the likelihood that a user likes a random video is very low. This way, we de-skew the data distribution.

Besides negative sampling, many scoring mechanisms measure the diversity of candidates. In session-based recommendations, more diverse recommendations can be introduced by taking the previous N items viewed by a user and measuring more diverse topics to recommend. For example, if one reads a few articles on politics and the movie industry, the following recommendations can include a few items from the sports industry.

Maximum Margin Relevance (MMR) is a measure used in information retrieval to strike a balance between relevance and diversity. According to the paper – 

Maximum Margin Relevance (MMR)
Maximum margin relevance (MMR) | Source

C is a document collection, Q is a query, R  is a ranked list of documents retrieved by an IR system,for a given C and Q. S is the subset of documents in R already selected; RS is the set of documents in R but not in S; Sim1 is the similarity metric used in document retrieval and relevance ranking between documents (passages) and a query; and Sim2 can be the same as Sim1 or a different metric. MMR computes incrementally the standard relevance-ranked list when the parameter λ=1, and computes a maximal diversity ranking among the documents in R when λ=0. 

Online evaluation

A/B experiments

The objective we train the model for and the offline metrics we measure might not be what we look for in reality. For example, if YouTube creates the most accurate model to predict clicks, it might not mean user churn would decrease. Though the model recommends all the videos a user likes, they might still drop off and not come back the next day.

Secondly, training a model precisely for what one wants is hard. For example, it is more complex to train a model to recommend videos to reduce churn than recommend videos based on clicks.

Standard online metrics include:

  • User retention
  • Engagement (likes, bookmarks, follows, etc.)
  • Clicks
  • Purchase
  • Revenue
  • Time-spent
  • Diversity in recommendations.

The ultimate moment of truth for any model is a live A/B test. The new variation is tested against the existing model. For example, the hypothesis says that the learning rate should be 10x the current rate. We launch a model for a random set of users on the platform with the new learning rate to test this. Since the current model and the new variation are running on the same distribution of users, any change to the online metrics can be attributed only to the change in learning rate. One can decide if the new variation is better or worse depending on the net change.

Testing recommender systems

Model evaluation vs testing

We saw how we could evaluate a recommendation model using various metrics and analyses so that our experiments and hypothesis hold. However, things can still go wrong when the model is shipped to production. Even minor engineering bugs can lead to unintended recommendations and poor user experience. Hence, testing every step – inference, re-training cycles, dataset creation, and feature ranges is essential for online deployments. 

In this section, we will take a look at some of the ways through which we can test our entire recommender system – from the model’s behavior to the health of the pipeline.

Behavior checks for a recommendation system

Measuring embedding update rate

Since RecSys models are built on embeddings, it is vital to ensure embeddings are trained correctly. With every re-training, the embeddings of a user and item update. An important metric to check is the average drift of different user/item embedding versions. Drift can be checked by measuring cosine similarity. 

For example, if a user A’s embedding yesterday was e1 and after re-training it is e2, the drift is measured as cosine(e1, e2). Ideally, this number should not be >0.9 but also not too close to 1. It indicates that the embedding has not converged if it is too small. If it is too close to 1, it suggests that the model might not be capturing the new interests of the user.

Metrics on different cuts

As mentioned earlier, single number metrics can be deceptive to look at. For example, 10% of popular items can comprise 80% of the dataset. Measuring AUC on the entire dataset can give optimistic numbers because the model has to learn that 10% of the items well. However, this means that the model has not learned the long tail of items well. Such negligence can lead to poor diversity and novelties. In such cases, one can analyze item-level metrics and check if all the items are performing reasonably well. The same applies to many other attributes like the user, gender, geography, etc.

Variance tests for session-based models

Session-based models require that new information is instantly consumed by the models to update recommendations. A good session-based recommendation model adapts quickly and accurately to the current interests of a user. 

Consider an RNN based model that takes in previous N interactions to recommend items for the N+1th position. If the model is biased towards popularity, it is bound to recommend the popular items after N-2, N-1 and Nth interaction. However, a good model will recommend a diverse set of items after each interaction. Mathematically, one can see the change in the hidden state after each time-step in the RNN model, just like we compute the embedding drift (explained above).

Similarly, if the user interacts with 10 videos across different topics like AI, comedy, or football, and responds positively to comedy videos and negatively to other topics, the next recommendations should include funny videos. One can measure the affinity to certain topics/genres in the session history and its manifestation in the next set of recommendations.

Software checks for a recommendation system

Apart from the standard unit and integration tests, there are a few RecSys specific behavioral tests you should look into”

  • Feature consistency: during model training, we might use many features besides embeddings, like user location, age, video length, etc. It is common to apply transformations like scaling to these features before using them. However, this opens up chances of making mistakes during inference due to the mishandling of features. For example, if you scaled a feature in training but not in inference, your model predictions could vary.
  • Leaky features: many models, like session-based, use near real-time information for each interaction. For example, the number of items a user interacts with.  If a user interacts with six items A, B, C, D, E, and F; the values of this feature would be 0, 1, 2, 3, 4, 5; since the user interacted with 0 items until they clicked on A, 1 until they clicked on B and so on. We use only the information available to us before the event occurs. During offline training, we should ask if it can cause a leak in the training while selecting data from tables.
  • Updated Embeddings: recommendation models, are trained periodically. After each training cycle, the updated embeddings should be used to recommend items. Using older embeddings can lead to inconsistencies and inaccurate recommendations.

You may also like

How AI and ML Can Solve Business Problems in Tourism: Chatbots, Recommendation Systems, and Sentiment Analysis

Tools for testing recommender systems

Here are some relevant tools for testing different stages of a Recommender System:

1. Dataset creation and feature engineering

Keeping track of feature distributions and anomalies in feature values are a few of the key numbers to track. Usually, recommendation model training is executed in DAGs via tools like airflow or kedro. After the dataset is created, one can write a test suite that tests the expected statistics against those in the data. Depending on an acceptable error margin, it is possible to create alerts. Pytest is a popular tool for writing such unit tests.

2. Training and deployment

Most recommendation models are trained in a deep learning fashion using gradient-descent-based optimization. Naturally, hyper-parameters like learning rate and weight decay number of training steps come into play. It is possible to spot abruptness in training using metrics discussed above and training-validation loss curves. Tools like Neptune allow monitoring model training with minimal code changes. It is possible to log curves, metrics, hyper-parameter, and scripts using Neptune’s simple API.

Open-source tools like RecList provide an easy-to-use interface to compute the most common metrics in recommender model evaluation. Given a dataset and a model, RecList can run specified tests on the target dataset. Beyond metrics, it also produces plots and deep-dive aggregates based on different slices.

3. Inference

Inference requires feature consistency, availability, minimum latencies, and access to updated models. With every code change, data scientists must ensure the above points. Software engineering practices like code reviews, version control like Git, and automating the testing stage with CI/CD processes (Jenkins, GitHub actions) ensure a safe software release.

Conclusion

Among many areas in ML, like NLP, computer vision, and such, recommender systems are relatively under-researched. Yet, they have one of the most impactful applications in modern-day digital applications. Although evaluating them is not straightforward, the above metrics and ideas are an excellent place to start. Remember that one should build a recommendation ‘system’ and not a ‘model’. In the long run, investing in building a solid infrastructure will help more than making a SOTA model.