Neptune Blog

Recommender Systems: Lessons From Building and Deployment

Dhruvil Karani

8 min

6th May, 2025

ML Model Development MLOps

If you look at recommender systems papers, a large number of them come from the industry instead of academia. This is because RecSys is actually a practical problem. RecSys for e-commerce could be considerably different than RecSys for social media, as the business objectives differ. In addition, every novel idea needs to be tested in the real world to gain credibility. As a result, learning the practicalities of RecSys is as essential as learning about novel architectures.

*Structure of a recommender system | Source*

This article discusses practical considerations while building a recommender system. Specifically, we are going to talk about my learnings regarding recommender systems in the following areas:

Dataset creation
Objective-design
Model training
Model evaluation
- Offline evaluation
- Detecting and mitigating bias
Checklist for checking model correctness
RecSys architecture
Online MLOps
A/B testing

Note: All views in the article are the author’s own and do not represent the author’s current or past employers.

Recommender systems: dataset creation

This step for RecSys is not as straightforward as text or image classification. For example, consider that we are creating a RecSys, which predicts clicks for an e-commerce website. We might train our model on all the data if we have a small number of users and items.

However, if we are working at an Amazon or Walmart scale, we have millions of daily active users and items in the catalog. Training a simple collaborative filtering model on the entire historical interactions will cost us a lot – Reading the data (in TBs if not PBs) from the data warehouse, spinning up a high-capacity VM (that will run for weeks). We must question if it is worth the cost and what is the correct way of going about this.

If we have a billion users in your database and a few million daily active users, then we must only train for these active users since the inactive users have fewer chances of showing up. One can select this subset of users by putting a threshold on activity in the last N days, like select users who clicked on >=10 items in the last 10 days. If a few users we have not included in training show up, we can fall back to a custom logic, like a content-based or popularity-based retrieval. Since RecSys models are trained periodically, this subset of users will keep changing. Once we select this subset of users, we can train our model on interactions with these users.

The next question is, how much data is enough? If we have five years of data, we don’t need all of it. Yes, a model benefits from more data. But in RecSys, the main idea is to best capture a user’s interest, which changes over time. So it makes more sense to have fresh training data. In addition, a simple collaborative filtering model cannot capture too much complexity. One can verify this by plotting a metric vs. number of training steps, which will most likely show diminishing gains.

Next, detecting duplicates in your dataset is helpful, like the same video/item posted two times with different Ids. Besides, NLP and CV models can help remove the dataset’s NSFW, harmful, and illegal content.

Following these steps can reduce the dataset size considerably. This will help us save costs with minimal loss of quality.

Recommender systems: designing optimal objective

The ultimate goal of RecSys is to give people what they want. Although this is a broad and rather philosophical question, we must narrow it down to a specific signal for which the model must optimize – predicting clicks, likes, shares, etc. When we train a model to predict clicks and use it to serve recommendations, our underlying assumption is that if you click on an item, it is relevant to you. More often than not, it is not completely true.

To understand this better, let’s use a different example. Say you are building a RecSys for YouTube, which predicts whether a user will click a particular video. This model is used to serve recommendations based on the click probability. However, this model resulted in lesser user time spent on the platform. The reason is that clicks are not equivalent to relevance. Most clickbait videos have a high click rate, but viewers stop watching them after a few seconds. A model that is 100% accurate would serve a high number of videos that are clicked but not watched.

Learning from the above, you decide to train a model that predicts if the user will watch at least 75% of the video. So the training examples will include (user, video, label) triplets, where label=1 if >=75% of the video is watched else 0. This is better than the click model because now we consider that the user has done more than just click a video. However, even this has a major problem.

Consider two videos, A and B. A is an entertaining 20 seconds long video, and B is a tutorial video of 60 minutes. To watch 75%, you need to watch 15 seconds of A and 45 seconds of B.

Naturally, A will have a higher positive rate of this label than B. However, watching 15 seconds of A could mean that the user did not like A (as 15 seconds is too less of a time to decide if you prefer the content), and watching 30 minutes (50%) of B most likely means that B is relevant to the user. Even a highly accurate model would end up serving a disproportionately large number of shorter duration videos, which is not optimal.

The point is that one signal rarely defines complete relevance. Each signal has its own bias. It is a good practice to train multiple models on multiple signals, combine their individual scores (weighted addition, for example), and create the final score.

Recommender systems: model training

Large NLP or Vision models have billions of parameters distributed among linear, convolution, recurrent, or attention layers. Each of these parameters is involved in the computation of the output. However, in recommendation models, model sizes are much larger than most NLP or CV models.

Consider matrix factorization, where the model learns a user and an item embedding (in the case of collaborative filtering). If the embedding dimension is 100, you have 100 million users and 10 million items. The total number of embeddings is 110 million. Each embedding has 100 learnable parameters. Hence the model has 110*100 million or ~11 billion parameters. However, to compute scores for a user, you need to access just one of the 100 million user embeddings at a time. This particular user embedding is used along with all the item embeddings to score all the items. Hence, recommendation models are memory intensive but compute light.

This is a different challenge because now you can’t and don’t need to load the entire embedding table on a GPU/TPU for a batch of data. However, writing such models on traditional frameworks like TensorFlow or PyTorch is hard because their default behaviour is to load the entire model on GPU/TPUs. Fortunately, many frameworks have built functionality for this very purpose.

Tensorflow has built a framework called tensorflow_recommenders with a special embedding table called TPUEmbedding. Besides, it has implemented versions of many common tasks in RecSys like retrieval and ranking and popular architectures like DCN.

Recently, PyTorch announced torchrec. According to the team:

“TorchRec is a PyTorch domain library built to provide common sparsity & parallelism primitives needed for large-scale recommender systems (RecSys). It allows authors to train models with large embedding tables sharded across many GPUs.”

NVIDIA also has Merlin, which automates common processes in RecSys for faster production-grade systems. It supports Tensorflow and PyTorch and is built on top of cuDF (GPU equivalent of Pandas), RAPIDS (GPU-based analytics and data manipulation library), and Triton (high-performance inference server).

Recommender systems: model evaluation

Offline evaluation

Typical classification task optimizes for metrics like accuracy, precision, recall, or F1-score. Evaluating RecSys using these metrics is deceptive. In RecSys, we are not interested in objective probabilities. We are more interested in ranking. For example, if predicted scores for videos A and B are 0.9 and 0.8, we will show video A first and then B while serving. Even if the probabilities for A and B were 0.5, 0.4, or 0.3,0.2, the outcome is still the same. It’s the ordering that matters, not the absolute numbers. Hence, metrics like ROC-AUC, PR-AUC, NDCG, recall@K, and precision@K are better suited.

However, even then, this evaluation is can fall short. Recommender systems are notorious for compounding bias towards certain topics, demographics, or popularity. A recommender system trains on logs generated by itself. If popular content is promoted more by the system, then the incremental logs generated will have more triplets for this popular content. The next version of the model, trained on these new logs, will see a skewed distribution and will learn that recommending popular items is a safe choice. This is called popularity bias.

It is advisable to compute metrics on different levels – user attributes like age, gender, location, etc. This helps us understand if the model is performing better for a particular set of users and not performing well for the rest. Tools like reclist provide an easy interface to deep-dive into your recommender model.

Another useful tool could be neptune.ai, as it provides simple logging APIs for a much more organized, collaborative, and comprehensive analysis. One can create custom dashboards to visualize the logs through interactive visualizations. As discussed above, we are interested in metrics at multiple cuts based on attributes like demographic and location. We can plot ROC/PR AUC, loss curves, and log ranking metrics here and easily compare and determine if the model is really robust or not.

Detecting and mitigating bias

As explained earlier, biases like popularity bias can easily propagate through the system if not taken care of. But how do we measure bias before mitigating it?

One easy way to measure popularity bias is to check how many unique items make 10%, 20%, 50%, .. 100% of recommendations. In an ideal case, the number of items should increase with % of recommendations. However, for a biased model, the number of items will saturate after a certain % (usually on the lower end). This is because the model relies on only a certain subset of recommendable items to make predictions.

But this approach does not take the user’s preference into account. For example, if a user U1 interacts with three items A, B, and C; and likes items A and B but not C. Similarly, user U2 interacts with A, B, and C; and likes only A. We know that A is a popular item while B and C are not.

	A (popular)	B (not-popular)	C (not-popular)
U1	1	1	0
U2	1	0	0

Example of a simple biased model

For U1, if the model scores higher for A than B, then it may be biased. Because the user response to both of them is positive. Even if the model consistently favours the more popular item, we have a biased model. However, for U2, it makes sense to rank the popular item higher because U2 does not like the other two non-popular items. Although the examples we have used are very simplistic, there are measures like statistical parity that help you measure this.

There are a few simple ways to mitigate bias. One way is to introduce negative samples. Consider an e-commerce platform where users interact with a few items out of hundreds shown. We only know what items the user interacted with (positive examples). However, we have no idea about what happened to the other items. To balance this dataset, we introduce negative samples by randomly sampling an item for a user and assigning it a negative label (=0). The assumption is that a user will not like an item picked randomly. Since this assumption is most likely true, adding negative samples actually adds missing information to the dataset.

Checklist for testing correctness of a recommender system model

Like any piece of software, one should ensure the correctness of the models by writing unit tests. Unfortunately, writing ML code unit tests is uncommon and tricky. However, for RecSys, let’s focus on a simple CF (collaborative filtering) model. As we know, the model is essentially the set of user embeddings and item embeddings. You can test this model for the following:

Correct Scoring – The scoring operation consuming a user and item embedding should produce a score between 0 and 1.
Correct versioning – Since the embeddings are retrained periodically, it is important to version them correctly so that the scores are consistent.
Correct features – Some models, like two-tower models, use features like user activity in the last X hours. One needs to make sure that the feature pipeline that the model consumes does not produce leaky features.
Correct training dataset – The dataset should not have duplicate user-item pairs, the labels should be correct, and the train-test-split should be random.

RecSys architecture

Recommender systems have to pick the best set for a user from a set of millions of items. However, this has to be done within strict latency requirements. As a result, the more complex model we train, the more time it takes to process one request. Hence, RecSys follows a multi-stage architecture. Think of it as a funnel that starts with a million items and ends with a handful of recommendations.

The idea is to use a simple, lightweight model at the top of this funnel, like a simple collaborative filtering model. This model should be able to pick a few thousand most relevant items, maybe not with the best ranking i.e., the relevant items should be present in this set of thousands of items, and it is okay if they are not at the top. Hence, this model optimizes recall and speed. This model is also called a candidate generator. Even in a simple collaborative filtering model, ensure the embedding dimensions are not too large. Using 100s of dimensions might give you a slight increment in the recall but affect your latencies.

Then, these thousands of items are sent to another model called light ranker. As the name suggests, the task of this model is to find the best ranking. The model is trained for high precision and is more complex than the candidate generator (for example, two-tower models). It also uses more features based on user activity, item metadata, and more. The outcome of this model is a ranked list top hundreds of items.

Finally, these hundreds of items are sent to heavy ranker. This ranker has a similar objective to the light ranker, except that it is heavier than the light ranker and uses even more features. Since it operates on hundreds of items only, the latencies involved with such complex architectures are manageable.

RecSys architecture — *Recommender systems architecture | Source*

Online MLOps for recommender systems

One good thing about recommendation models vs. a classification or regression model is that we get real-time feedback or “labels”. Hence, we can set up a comprehensive ML Ops pipeline to closely monitor your model performance.

There are many metrics we can monitor.

1 Time spent on the platform
2 Engagement
3 Clicks
4 Purchases
5 User churn

Model performance on metrics like engagement is easy to measure in offline experiments. However, you can’t measure something like churn in an offline experiment. It is common to find such discrepancies in real-world RecSys. Usually, we analyze what online metrics that are measurable offline (like time spent, engagement, clicks) have a positive correlation with churn. This reduces the problem of improving a set of predictable metrics in offline experiments.

Besides model quality and performance, we should monitor things like average, 95th percentile, and 99th percentile latencies, CPU, non-200 status code rates, and memory usage. Not so surprising, but improving these metrics also improves the time spent and reduces churn. Tools like Grafana help set up comprehensive observability dashboards.

Retraining pipelines can also break down because of problems not related to bugs in code, like not enough pods available in your Kubernetes clusters or not enough GPU resources available. If you are using DAGs on Airflow, it has the option to set up a failure alert on Slack. Alternatively, tune the number of retries and timeout parameters so that the chances of automatic recovery improve.

Recommender systems: A/B testing

Improving recommender systems is a continuous process. However, this improvement should not worsen the user experience. If your team comes up with a novel model that shows amazing gains in offline evaluation, it is not obvious to roll out the model for all the users. This is where A/B testing comes into play.

Any new target model must be evaluated against the control (existing production) model. In an A/B test, you would randomly select a small percentage of users and serve them using the target model, while the rest receive recommendations from the control model as before. After a few days/weeks, look at which model performed better and quantify it using hypothesis testing. If the test concludes that the new model gives gains over the control, you roll out the new model for all users.

However, it is a good practice to roll out the new model to only 98-99% of users and let the rest 1-2% be served by the control model. This 1-2% of users is called the holdout set. The idea here is to see if, at some point, the new model starts degrading, is it due to some change that impacts all models, or if something is wrong with this new model alone? In RecSys, a target model, when served to a small set of users, is still trained on logs majorly generated by the control model. However, it is possible that when the new model becomes the control, it starts learning from the logs majorly generated by itself and degrades.

Conclusion

RecSys has many moving parts, and each of these parts is a knob that can be tuned to make the system better. Personally, this is what makes RecSys really interesting to me. I hope the article was able to provide new directions of thinking. Each of these topics has a varying amount of literature for you to explore. I have linked some references below. Make sure to check them out!

References

[1] TwHIN: Embedding the Twitter Heterogeneous Information Network for Personalized Recommendation

[3] Lessons Learned Addressing Dataset Bias in Model-Based Candidate Generation at Twitter

Was the article useful?

More about Recommender Systems: Lessons From Building and Deployment

Check out our product resources and related articles below:

How to Build Machine Learning Systems With a Feature Store

Building ML Platform in Retail and eCommerce

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Product resource

How Elevatus Can Now Find Any Information About a Model in a Minute

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM