In this first installment of the series “Real-world MLOps Examples,” Jules Belveze, an MLOps Engineer, will walk you through the model development process at Hypefactors, including the types of models they build, how they design their training pipeline, and other details you may find valuable. Enjoy the chat!
Hypefactors provides an all-in-one media intelligence solution for managing PR and communications, tracking trust, product launches, and market and financial intelligence. They operate large data pipelines that stream in the world’s media data ongoingly in real-time. AI is used for many automations that were previously performed manually.
Could you introduce yourself to our readers?
Hey Stephen, thanks for having me! My name is Jules. I am 26. I was born and raised in Paris, I am currently living in Copenhagen.
Hey Jules! Thanks for the intro. Walk me through your background and how you got to Hypefactors.
I hold a Bachelor’s in statistics and probabilities and a Master’s in general engineering from universities in France. On top of that, I also graduated in Data Science with a focus on deep learning from Danish Technical University, Denmark. I’m fascinated by multilingual natural language processing (and therefore specialized in it). I also researched anomaly detection on high-dimensional time series during my graduate studies with Microsoft.
Today, I work for a media intelligence tech company called Hypefactors, where I develop NLP models to help our users gain insights from the media landscape. What currently works for me is having the opportunity to carry out models from prototyping all the way to production. I guess you could call me a nerd, at least that’s how my friend describes me, as I spent most of my free time either coding or listening to disco vinyl.
Model development at Hypefactors
Could you elaborate on the types of models you build at Hypefactors?
Even though we also have computer vision models running in production, we mainly build NLP (Natural Language Processing) models for various use cases. We need to cover multiple countries and handle many languages. The multilingual aspect makes developing with “classical machine learning” approaches hard. We craft deep learning models on top of the transformer library.
We run all sorts of models in production, varying from span extraction or sequence classification to text generation. Those models are designed to serve different use cases, like topic classification, sentiment analysis, or summarisation.
May interest you
Could you pick one use case from Hypefactors and walk me through your machine learning workflow end-to-end?
All our machine learning projects tend to follow a similar life cycle. We either start an ML project to improve our users’ experience or add a meaningful feature to our clients’ experience and then translate it into an ML task.
Let me walk you through the process we followed for our latest addition, a named entity recognition model. We started by crafting a POC (proof of concept) using out-of-the-box models, but due to some drift between our production data and the data the models were fine-tuned on, we had to internally label our data following annotation guidelines that we neatly defined. We then started by designing a relatively simple model and iterated over it until we reached a performance comparable to the SOTA. The model was then optimized for inference and tested under real-life conditions.
Based on the outcome of the QA (quality assurance) session, we iterate on the data (e.g., refining the annotation guidelines) as well as the model (e.g., improving its precision) before deploying it to production. Once deployed, our models are continuously monitored and regularly improved using active learning.
Could you describe your tool stack for model development?
We use several different tools for model development. We recently migrated our codebase to a combination of both PyTorch Lightning and Hydra to reduce boilerplate. The former enables structured code between four main components:
- 1 Data
- 2 Model
- 3 Optimization
- 4 Non-essentials
PyTorch Lightning abstracts away all the boilerplate code and engineering logic. Since its adoption, we have noticed a significant speedup when iterating on models or launching new PoCs (proof of concepts).
Additionally, Hydra helps you “elegantly” write configuration files. To help us design and implement neural networks, we heavily rely on the Transformer library. When tracking experiments and data versioning, we use neptune.ai, which has smooth integration with Lightning. Finally, we picked Metaflow over other tools to design and run our training pipelines.
Might interest you
How does your NLP use case drive the training pipeline design choices?
Running an end-to-end NLP training pipeline requires a lot of computing power. To me, one of the most arduous tasks in natural language processing is data cleaning. This becomes even more relevant when working with textual data directly extracted from the web or social media. Even though big language models like BERT or GPT are fairly robust, data cleaning is a crucial step as this can directly impact a model’s performance. This implies quite heavy preprocessing and thus the need for parallel computing. Also, fine-tuning pre-trained language models requires running training on hardware optimized for computation (e.g., GPU, TPU, or IPU).
Also, we treat the evaluation of our NLP models differently than “regular” ones. Even though evaluation metrics are quite representative of a model’s performance, one can not solely rely on them. A good illustration of such a problem is the ROUGE score, used for abstractive summarization. Even though the ROUGE score might give a good representation of the n-grams overlap between the summary and the original text, manual inspection is needed to assess semantical and factual exactness. This makes it really hard to have a fully automated pipeline that does not require any human intervention.
What tools do you use for your training pipelines, and what are their main components?
We recently started to design reusable end-to-end training pipelines, mainly to save us time. Our pipelines are conceived using Netflix’s Metaflow, and they all share the same building blocks.
We first fetch freshly manually annotated data from our labeling tool before processing it. Once processed, the dataset is versioned along with a configuration file.
We also save code and git hashes, making it possible to reproduce the exact same experiment. We then start training the desired model.
At the end of the training, the best weights are saved into an in-house tool and a training report is generated, enabling us to compare this run with previous ones. We finally export our checkpoints to ONNX and optimize the model for inference.
The way our pipelines are designed, anyone with a bit of technical knowledge can either reproduce an experiment or train a new version of an existing model with freshly annotated data or a different configuration.
What kinds of tools are easily available out there and what tools are required to be implemented in-house?
Regarding the modeling aspect, we heavily rely on the transformers library. However, due to the specificity of our use cases (web data and multilingual needs), we craft models on top of it. One of the drawbacks of working with such massive models is that they are hard to scale. There are quite a bunch of tools available to shrink transformer-based models (e.g., DeepSpeed, DeepSparse), but they suffer from base-model limitations. We have implemented an in-house tool that enables us to train various early-exiting architectures, perform model distillation, pruning, or quantization.
The experiment tracking and metadata store space features plenty of complete tools which are easy to use, so there was no need for us to reinvent the wheel.
The same goes for ML workflow orchestrators. We actually spent quite some time picking one that was mature enough and for which the learning curve was not too steep. We ended up choosing Metaflow over Kubeflow or MLFlow because of its ease of adoption, its available features, and its growing community.
In general, there are a plethora of tools available for all the different building blocks of a machine learning workflow, which might also be overwhelming.
What type of hardware do you use to train your models and do you use any kind of parallel computing?
All our training flows are run on machines featuring one or more GPUs, depending on the compute power required for the given task. PyTorch Lightning makes it relatively easy to switch from single to multi-GPUs and also comes with various backend and distributed modes. NLP tasks require relatively heavy preprocessing. We thus use distributed training through the DDP PyTorch mode, which uses multiprocessing instead of threading to overcome Python’s GIL problem. Along with this, we try to maximize the usage of tensors operations when designing models, to fully leverage the GPUs’ capabilities.
As we only fine-tune models, there has not been a need for us to perform sharded training. However, we occasionally train models on TPUs when we have the need to iterate fast.
When it comes to data processing, we use “datasets,” a Python library built on top of Apache Arrow, enabling faster I/O operations.
What tool(s) do you wish to see coming out in the near future?
I think every Machine Learning Engineer will agree that what is currently missing is this one tool to rule-them-all. People need to have at least 5 to 6 different tools for the training part only, which makes it hard to maintain as well as to pick up. I really wish we will soon see emerging tools that will encompass multiple steps.
Closer to the NLP space, I am seeing more and more people focusing on ensuring annotation quality, but we are still quite limited by the nature of the task. Spotting wrong labels is a difficult task, but a solid tool for it could really be a game-changer. I think most data scientists will agree that data inspection is a really time-consuming task.
Also, an important aspect of the workflow is model testing. It is really tricky in NLP to find relevant metrics guaranteeing the faithfulness of a model. There are a couple of tools popping up (e.g. we started using Microsoft’s “checklist“) but having a wider range of tools in this area could, in my opinion, be interesting.
For each task, our data experts come up with a set of behavioral test cases, from relatively simple to more complex, divided into “test aspects.” We then use checklist to generate a summary of the different tests and compare experiments. The same goes for model explainability.
Thanks to Jules Belveze and the team at Hypefactors for working with us to create this article!