Leveraging Unlabeled Image Data With Self-Supervised Learning or Pseudo Labeling With Mateusz Opala
This article was originally an episode of MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.
Every episode is focused on one specific ML topic, and during this one, we talked to Mateusz Opala about leveraging unlabeled image data with self-supervised learning or pseudo-labeling.
You can watch it on YouTube:
Or listen to it as a podcast on:
But, if you prefer a written version, here it is!
You’ll learn about:
What is pseudo-labeling and self-supervised learning
Pseudo-labeling applications: image and text data
Challenges, mistakes and potential issues while applying SSL or pseudo-labeling
How to solve overfitting with the pseudo-labelling
How to create and enhance datasets?
MLOps architecture for data processing and training when using pseudo-labeling techniques
- 7 And more!
Sabine: With us today, we have Mateusz Opala, who is going to be answering questions about leveraging unlabeled image data with self-supervised learning or pseudo-labeling. Welcome, Mateusz.
Mateusz Opala: Hello, everyone. Happy to be here.
Sabine: It’s great to have you. Mateusz has held a number of leading machine learning positions at companies like Netguru and Brainly. So, Mateusz, you have a background in computer science, but how did you get more into the machine learning side of things?
Mateusz: It all started during my sophomore year at university. One of my professors told me that Andrew Ng was doing his first iteration of the famous course on machine learning on Coursera. I kind of started from there, then did a bachelor thesis on deep unsupervised learning and went to Siemens to work in deep learning, and then all my positions were strictly about machine learning.
Sabine: You’ve been on that path ever since?
Mateusz: Yes, exactly. I worked for some time before as a backend engineer. But for most of the time in my career, I was a machine learning engineer/data scientist.
What is pseudo-labeling?
Sabine: Mateusz, to warm you up. How would you explain to us pseudo-labeling in one minute?
Mateusz: Let’s try.
- Imagine that we are having lots of data and just small amounts of data are labeled, and most of that data is unlabeled, and we want to train our favorite neural network, let’s call it ResNet 50.
- In simplification, we trained a model on a bunch of labeled data, and then with that model, we predict labels on a bunch of unlabeled data.
- We use the predicted labels as the targets to calculate the loss function of unlabeled data.
- We combine the loss from labeled and unlabeled data to backpropagate through the network and update weights. This way, we leverage the unlabeled data in the training regime.
Was it one minute or longer?
Sabine: Nice job. I think that definitely fits inside one minute,
Mateusz: I can give you one analogy to the computer science development process, how one could think about this.
Let’s say we have a software development team, and there are a few senior engineers and a bunch of mid-junior engineers. Senior engineers produce better code quality, obviously, than juniors or mids, but you can hire just a limited number of senior engineers, and you also want to grow the mid and juniors. So you need to construct a team of both and make it efficient.
If you invest in code reviews and best practices, testing, automated CI, and CD, then junior engineers are also able to deliver code to production as well.
- You can think that senior engineers are the labeled data here,
- And the junior engineers refer to the unlabeled pseudo-label once.
Investing in the code review is like scanning the loss function. At the beginning of training, you need to invest more, so actually, you care more about the labeled data. Once the network starts to make good predictions, you also benefit from the unlabeled data, so from the junior and mid-engineers when your development practices are very solid.
Sabine: All right. Thank you for that analogy.
What is self-supervised learning?
Sabine: We do have a community question: what is self-supervised? Mateusz, would you mind giving a bit of a summary?
Mateusz: Sure. Self-supervised, I would say that’s the subset of unsupervised techniques when you don’t have labels. The self means that you use the input image to generate the label. In this use case of simple contrastive learning, to generate the label, you take the image, you do the two augmentations of the same image, and you know that this is the same image, and that’s your label. If you do the augmentation of two different images, and you compare them to each other, then your label is that they’re not the same images.
Basically, you generate the labels from your data. You train in supervised learning, but you don’t have annotated labels like in supervised learning, but the labels are generated somehow from your inputs.
Pseudo-labeling applications: image and text data
Stephen: Awesome. As you mentioned, you’re currently in Brainly as a senior machine learning engineer. Can you walk us through some of the different use cases where you apply pseudo-labeling for image data in Brainly?
I know Snap to Solve is one of the products that probably uses it. You know, you probably have more ideas.
Mateusz: Yes, sure. The Snap to Solve is the feature that my team works on the most. Maybe I’ll shortly explain what it’s about.
Basically, when you open the mobile phone, you can make a quick image of the question you would like to have answered. Then as a user, you can adjust the crop to select the question, and then you route it to either text search or our math solver depending on what’s on the image, and you’re getting the answer you needed.
Our team works on projects like
- understanding what’s on the image, understanding the layers of the question,
- detecting the quality issues with the image,
- trying to inform users that they could improve somehow the image they took to get a better answer,
- and also on that routing to the specific services that are needed for the question. For example, if there is math, it can, instead of just searching through the database, it can be directly solved, for example.
Last year, we had a project called VICE, which was about visual content extraction.
In that project, we wanted to understand the layout of the question. It was simply an object detection model that tried to predict classes like:
- and so on,
everything that’s kind of visible on the question layout.
The thing is that you always have a limited budget for labeling. Even if you have a strong budget, strong company, the company is not a start-up – there’s always a limit. Not only about the money but also about the time – how much time can you actually wait for the labels.
In Brainly, we have lots of images taken from the users, and we really like to leverage all that unlabeled data. Also, when you want to start labeling for training purposes, you would like to have a more or less balanced distribution. You would like to have a similar amount of text boxes and table boxes, and so on. Your data is, obviously, very imbalanced usually.
Imbalanced Data in Object Detection Computer Vision Projects
Our first approach to reusing self-supervised learning was to actually do some unsupervised or semi-supervised classification to generate data for labeling to downsample the data from all of the images we had. So we could label for the training purposes only, only a small subset, which would still be uniform.
In that project, we work on a paper called simple contrastive learning. On top of the paper, there are two frameworks for unsupervised classification called:
- 1 Spice
- 2 Scan
Simple contrastive learning is basically about contrasting two images, one against the other. You do it by taking the original image, and you do the data augmentation and perturbation of the image. You do two perturbations of the same image. As an input, you have different images, but you know, they’re the same, and you learn the similarity of these images and, as a result, you get good embeddings for that image.
Based on that embeddings, having a very small amount of labeled data, we could actually sample very well training weak classifiers to finally obtain good candidates for labeling. That was our tem’s first approach to self-supervised learning.
Pseudo-labeling is an interesting case in our situation since, in the original paper, it is the same network that generates the pseudo-labels. We go a bit different ways since, in our case, we have multimodal input sometimes, so we have text and image. But not at all stages we have text, so sometimes, we just need to deal with the image.
However, when creating datasets and when training, we might reuse the historically available text. We kind of use an NLP-based approach to generate a pseudo-label for the model that will then work in the production for the inference only on the image.
Stephen: So I’m wondering cause I’m going to come back to the use case of Brainly now because Snap to Solve. I want to know:
- did you try out all the techniques before the self-supervised learning technique,
- or did you just know that this particular technique is one that we feel works, and then you just applied it straight away?
- How does it stack up against all the techniques pretty much?
Mateusz: In general, most of the techniques we use it’s still supervised learning, and we label data, but it’s limited and it’s time-consuming.
The best use case for us for applying self-supervised learning is when we want to downsample from all of the data we have for the labeling. We actually want to make sure that we have different kinds of data in that labeling, and we also cover all interesting cases for us.
We might not have a 50-50 distribution of handwriting and the images of the textbooks. In some markets, this might be more handwriting, and in some markets, it might be just a little handwriting, but in the end, the training is best if we have the data that also contains the handwriting.
It contains different kinds of data, so we can:
handle it better
- 2 and it generalizes better.
We came up with self-supervised learning for clustering or unsupervised image-classification purposes.
There are these cases that I mentioned where we have the text and the images. Specifically, you can imagine the use case, which is not a true use case, but you can imagine that, that we have an image, the one with some text, not like the question in Brainly, but in general, you have some banners from the shop, generally, there is the image, and there is text.
Let’s imagine that you have some method to generate text from the image. You have your data, you have images and text. The text says that there is a shop 24 hours, and there is an image of that shop actually. What we would like to do is to generate the pseudo-label for the image based on the text, to understand whether it’s, for example, a shop or a stadium.
We can leverage some NLP model, we can reuse BERT or anything like that to do fine-tuning. We can do the zero-shot learning things and so on to generate the labels, and we can treat them like smooth labels and then just train the model solely on the images.
Currently, the most interesting to us is how we can reuse the modalities that are not available during the inference but reuse them to generate the label, so we don’t need to label everything.
10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
Stephen: Awesome. Thanks. By the way, if you want to learn how Vice works, we did a case study with Brainly. If you want to learn how VICE works and Snap to Solve, I think that’ll throw more light.
Mateusz, before Brainly, did you have any experience working on pseudo-labeling, and how was that for you? What applications were you using at that time?
Mateusz: I had, actually, just when the paper came out (I think the paper is from 2014). In 2014, I worked in a small startup in Kraków, and we did small projects for small startups.
There was a startup that was doing smart dog collars. The smart dog collar was equipped with sensors like an accelerometer, gyroscope, thermometer, and so on. The goal of our machine learning system was to predict the behavior of the dog – whether the dog is eating, drinking, or running. Later on, we could automatically send some tips to the dog owner, the alert would say there is a high temperature and the dog didn’t drink water for a long time.
Imagine that getting the data from sensors is easy because you just put that dog collar on the dog, but labeling that data that’s the very difficult one. It’s a funny story how we actually labeled that because there are these people who, for their job, they take a lot of dogs out. We just connected it with these people, and we went on a walk with a lot of these people multiple times, with the dogs, and we are just noting that from 2:10 to 2:15, the dog was drinking and so on.
That’s not a really feasible way to gather a lot of annotations, but it was easy to gather a lot of unlabeled annotations. Since we suffered very much from overfitting, as far as I remember, we explored that pseudo-labeling angle at the time, and it’s helped a lot to tackle the overfitting problem for that model.
Sabine: Maciej wanted to get the title or link to the paper that was mentioned.
Mateusz: The pseudo-labeling original paper was Dong Lee. I think it’s from 2013.
Sabine: We actually have a question in chat. How did you choose the image augmentation to train your SSL model? Did you use the one from the paper, or did you experiment to find augmentation that suited your data the best?”
Mateusz: I started with exploring the data augmentations from the paper, so exactly the scheme, but I also tried different kinds of augmentations. I remember that the setup slightly differed for us since our domain is really different actually than ImageNet. So it’s reasonable, that it’s something different.
For example, we don’t do flipping because you shouldn’t flip the text, at least not in English, but I used Nvidia DALI for data augmentations on GPU. Pretty much, I explored all the typical augmentations that are in that library. I know that, for example, in the Albumentations there are much more to be exported, but it’s slower, so usually, I stick to the Nvidia DALI one.
May interest you
Challenges while applying self-supervised learning or pseudo-labeling
Stephen: Speaking of the challenges, what were the challenges you encountered when you were applying self-supervised learning or pseudo-labeling in your applications in Brainly?
Mateusz: With simple contrastive learning, this algorithm requires a lot of data and even 1 million images. I think it’s not really easy to train that algorithm. Obviously, at Brainly, we have more, and we can train on more amounts of data, but also, training takes a lot of time, and the project has its constraints.
Finally, we ended up that the pre-train embeddings on the simple contrastive learning weren’t really much better than just the pre-trained on ImageNet. It was more about the task of choosing the candidates for labeling.
The most important part was actually:
trying something simple like separate vector machines on that pre-train embeddings
- 2 and re-tuning them with optimization for hyperparameter search,
and that worked well for the most difficult cases.
In general, for tunning the simple contrastive learning, I think, it requires:
very much computational power,
a good way to distribute the algorithm,
- 3 and also pretty much big batch sizes from what I remember from the paper.
They trained it originally on the bunch of TPUs, and the paper is, I think, from Google also. It’s not as easy to reproduce everything that’s on the TPUs with the size of what you are constrained to, for example, on the GPU sometimes in terms of the memory size and the batch size. These are challenges I see there.
In terms of pseudo-labeling, it’s kind of different. Usually, you have a very small labeled dataset. And if it’s too small to learn the underlying cluster structure that can separate, noisy, but well the initial examples. You’re just adding noise to your data when you’re adding more and more unlabeled loss coefficient when you increase that.
- The first problem could be a small labeled dataset.
- The next one is that when you do the pseudo-labeling, you have that loss function that is a weighted combination of the loss from labeled data and loss from the unlabeled data. Usually, you start with the zero loss from the unlabeled data, and you like to warm up your network on the labeled data. You could start increasing the loss function from the unlabeled too fast, for example, before it actually learns the cluster structure.
Also, in neural networks, there is usually a phenomenon of overconfidence. The predictions are very close to one, for example, or very close to zero, and especially when you do the pseudo-labeling and the prediction, obviously, is sometimes incorrect, it also reinforces that phenomenon and adds even more noise to data, and there’s something called confirmation bias then and you need some techniques to tackle that.
Usually, it’s done by applying a mix-up strategy, so it’s a strong data augmentation combined with label smoothing for regularization purposes, and that’s something that can mitigate that confirmation bias.
Stephen: Awesome. Is this particular application something a small team can apply or it requires tons of resources? Can you walk us through how tedious this will be for a small team to start applying this, especially when they have smaller datasets, because this is way more relevant when they don’t have Google size datasets?
Mateusz: I would say that techniques like simple contrastive learning, which, in general, are self-supervised techniques, usually require:
a lot of computation,
- 2 a lot of GPUs
and that’s definitely difficult for the small team or just an individual working on something if they have no access to the proper infrastructure for that.
I don’t think that this technique is the best for small teams, probably the pre-trained models still work better.
Also, the models that are trained self-supervised are also sometimes published, and there is actually a great library on the MIT license from Facebook on self-supervised learning. It’s very easy to reuse, and it’s built on top of PyTorch.
But pseudo-labeling, it’s something that’s very easy to implement and it can be really useful for fighting or overfitting and regularizing your network and making it work when you have a smaller dataset.
Common mistakes when applying pseudo-labeling
Stephen: Have you seen common mistakes that teams make when trying to apply pseudo-labeling or maybe even trying to apply self-supervised learning techniques for their systems?
Mateusz: Typical problem with pseudo-labeling is when your small amount of data is not enough to satisfy the cluster assumption. There is the assumption that the data is separated well in the decision boundaries in the low density regions.
It’s basically the idea that the images that are close to each other and they are in the same cluster share the same label. If you don’t have enough data to learn quickly underlying cluster structure, not maybe optimal, but good enough for the pseudo-labeling, then you end up just adding noise to the data.
Also, you might do everything well, but your initial small dataset might be inconsistent, and inconsistency in labeling it’s something that hugely influences the quality of pseudo-labeling training, at all.
How to solve overfitting with the pseudo-labelling
Stephen: You mentioned earlier about pseudo-labeling being this particular technique to use to overcome overfitting. How did you achieve that in your use case? Can you give us details on the scenario where you were battling overfitting, and then pseudo-labeling came to the rescue?
Mateusz: In the times of overfitting, my use case was more or less in the past experience, that the one with dog collars and also more with the NLP use cases.
At Brainly, we currently have one use case where we are exploring the pseudo-labeling possibility to apply. Basically, the reason we are tackling overfitting is that the task that we are solving is very subjective to define, and we struggle with labeling consistency. Also, we don’t have a good week classifier, so we need to handle some of the class imbalances where we have not so many images on the class we want to detect.
That’s a great case, actually, for the semi-supervised learning techniques and pseudo-labeling, where we need to leverage all that unlabeled data.
How to create and enhance datasets?
Stephen: Cool. Just zooming into this particular one. At some point, you hit this roadblock, right? What do you do? How do you think about enhancing this technique you’re using, or do you just explore other techniques?
Because you talked about smaller datasets being a major challenge with using pseudo-labeling. How do you enhance the quality of your datasets? Do you consider maybe synthetic datasets? Can you walk us through that?
Mateusz: We try to be creative with how we create datasets. We don’t really need to recreate data like images because we have so many images. If we are having a label for an image, it’s better for us to search for similar images. We have some pre-trained embedding for similarity like simple contrastive learning. If we find similar images, we can mark them like that’s having the same label. That’s one thing.
The other thing, which I like, it’s also, usually, people think about data augmentation as the augmentation of images or text, basically not the targets, but the inputs, right?
A few years before, I was doing pose detection, and it was also time-consuming to label the pose of humans since you need to label like 12 body joints or something like that. We also struggled with overfitting.
We had the idea that if you label the body joint of the pose and you move your label of the body joints just a couple of pixels, it’s basically the same labeling since you’re just labeling the whole head with a single point. We did a target augmentation. Similarly, you can think of the data augmentation that we are trying to do at Brainly sometimes, that we try to change input images, so they reflect different targets that we lack, actually.
That’s also the way how to creatively create and increase the number of images in datasets. At the end of the day, it’s best just to label your images. Sometimes, that’s what I’m just doing personally. I am just labeling more images to:
improve my model performance
- 2 or improve my methods for something,
but it’s important to be very creative in the creation of the dataset.
I believe that the creation of a dataset in the production environment, like in the commercial setting, is very important, even more, important than the training.
I think Brainly’s approach to machine learning is a very data-centric approach, and we try to build our software the way that if we need to change the dataset, we can rerun everything and quickly have the updated model on the new dataset on production. I really believe that being creative and putting emphasis on dataset creation is very important.
Stephen: Speaking of datasets as well, we spoke about small teams earlier being these people that have access to labeled datasets. Of course, we have lots of unlabeled datasets out there, and those are most likely inexpensive to get.
How can they find this balance, especially if it’s very crucial for their use case?
How can they find this balance, they have these small labeled datasets, but there’s a large amount of unlabeled datasets out there, and they have to use this particular technique.
How would you advise that they go about finding that balance and applying pseudo-labeling properly or even on self-supervised learning?
Mateusz: I would advise that you need to consider, obviously:
- What infrastructure do you have?
- How much data can you actually train on?
- What kind of resolution of data for your problem?
- What time do you have for that?
- Whether you’re paying for the cloud or it’s somewhere in your house, when your only constraint is the size of the GPU and the time of the training?
When you consider all that, I would just start with the smallest labeled dataset that is actually training something.
It’s not working like flipping a coin, but it’s actually training. I would try to as early possible to visualize that, to see whether indeed there are some clusters created of your clusters in the dataset and whether they make sense.
- If they do start making sense, then there’s that part when you can add unlabeled data. In the original setting, it’s done simultaneously that you’re training on the labeled and unlabeled data, but obviously, you can just start with only a small amount of label data, see whether it performs just a bit, see whether the visualization makes sense, and then you can do the two-stage training when you figure out that that’s enough data.
- If your data is not enough and you don’t see any clusters, and it’s not training, then you simply need to label more at the beginning. Once you are there, then you can start adding up the label data. You can just start from the beginning with your training procedure and try to make it simultaneously. But even if you’re training simultaneously, you just start with the training unlabeled part and the coefficient for the unlabeled loss, it’s the way that’s zero at the beginning and then increases linearly until it gets to the final value, and you’re still training for some time.
Potential issues when applying pseudo labeling
Stephen: Beyond the dataset problem, have you found circumstances where some issues affect the efficacy of your pseudo-labeling for your image tests?
Mateusz: Beyond the dataset problem. I would say that typically connected to the training issue is the overconfidence of the neural network predictions. That’s something that’s very hard to tackle. That’s the thing with the confirmation bias. You can do the mix-up strategies and so on. But at the end of the day, it’s very difficult.
Actually, to understand our predictions whether they make sense, we also use the explainers like SHAP Values or the older LIME, but they’re not necessarily always working well with the images. Sometimes they do, sometimes they’re not.
The overconfidence of the neural networks, even if you have good metrics like on test set, on the validation set for your task, whether it’s Precision, Recall, F1, whatever, it’s still not great if you see that your predictions are very overconfident, something might be wrong there. It definitely influences the ability to reuse pseudo-label as well.
Stephen: Got you. I think there’s this particular, I don’t know how common it is, but it’s like cluster assumption is a necessary condition for pseudo-labeling to work. What do you make of that particular phrase itself?
Mateusz: The cluster assumption basically says that the data, when classified, it should be formed into separate clusters and the decision boundary. When you think like similar, like in the SVM scenario, the decision boundary needs to be in the low region density.
What they did in the original paper, actually, was a very interesting experiment on pseudo-labeling. They train on MNIST dataset, the well-known one, but some experiments were, later on, reproduced on the CFAR and so on. It’s not only MNIST setting, but on MNIST, they trained the model, and they visualized the prediction using t-SNE for dimensionality reduction on the plane, on the 2D plane.
Actually, the separation of the prediction, when it’s trained in a purely-supervised manner, it’s not as great as when you use pseudo-labels.
When you use pseudo-labels, the clusters are clearly pushed from themselves, so there is a clear boundary between the clusters. That shows that the entropy regularization, which is simply a pseudo-labeling loss function, is simply regularizing entropy regularization, which means that we are trying to decrease the overlap of classes. In the end, when you visualize it, it’s indeed decreased, and the clusters of classes are really separated.
Stephen: Perfect. In terms of the biases. When using pseudo-labeling, have you found there are ethical issues with using that? If there are, maybe can you let us know?
Mateusz: I think issues are somehow inherited from the dataset you’re using. I don’t think it’s influenced more by the model or the technique of the model.
If the biases are in the dataset, they will be reproduced by the model. If you want to de-bias your model, you need to de-bias your dataset.
Stephen: Perfect. I believe that quite a lot of pseudo-labeling and self-supervised learning are still actively in research, right.
Are there particular situations or scenarios where you actually apply these techniques, and then they improve the robustness of your model or your model performance, whether it’s at Brainly or even your previous companies? Because we have teams who share this and say, “Hey look we could try this out, but we need actual numbers to understand how it helps in sort on the real world production?”
Mateusz: In this typical scenario of the pseudo-labeling, when you use the labels from the training model, in the case of the dog collar thing, our model was overfitting to the way that it was really not deployable. Even if it had like good enough performance, for example, for the classification, but the gap between the training set and the validation was huge so I wouldn’t trust that model. The pseudo-labeling helped in a way that the gap was limited, and the gap was small enough that I saw that it’s not overfitting anymore.
Maybe it wasn’t perfect metric, but it wasn’t overfitting, so it started to be deployable. That definitely helps, and that was in the original setting. The pseudo-labeling, when we used the implementation from the original paper (which is very easy to implement in any framework, whether you use PyTorch or TensorFlow) and already there are lots of improvements, to doing that with the confirmation bias and using mix-up strategy.
Also, in the original paper, for example, for the pseudo-labels, they do the arg max on the output of the model. They used hard predictions, and especially in the mix-up paper, they show that the hard predictions are the reasons also for the overconfidence of neural networks and therefore, there is a small mix-up, or just labels smoothing, which helps as a regularizer to improve, to tackle overfitting.
May interest you
Overfitting vs Underfitting in Machine Learning: Everything You Need to Know
Fighting Overfitting With L1 or L2 Regularization: Which One Is Better?
MLOps architecture for data processing and training when using pseudo-labeling techniques
Stephen: I want to come back to the compute side of things briefly.
Are there specific architectures that you apply at Brainly when using these techniques in terms of your computer architecture?
Do you use distributed computation, especially in the data augmentation, which I believe is going to be distributable?
How do you set up the architecture for both the data processing, which is a huge deal, as well as the training of the models themselves?
Mateusz: For most of the stuff, we use SageMaker. For the experiment tracking, we use Neptune. That’s more on the development side, but we track there everything like processing jobs. We try to track everything to just not miss anything during the creation of the dataset or anything like that. In the terms of computation, we just use the SageMaker estimators and SageMaker Pipelines, and they both support multi-GPU stances and extreme multi-node instances.
We try to also do the training just on the cluster of instances, where each instance had a multi-GPUinstance. We use mostly PyTorch, and it supports, there is that tool called Torch Distributed, which we use just for running distribution over the PyTorch. There is also native SageMaker way to orchestrate that. We are exploring that also currently whether it improves something or not.
There is also some work to be done, I think, in the terms of optimization. In the typical setting is the Horovod algorithm. In the past, I had some experience with distributed algorithms that are better than Horovod, for example, Elastic Averaging SGD, which had actually sometimes, in some use cases, super linear speed up for the training convergence. That’s something that is also worth exploring in that term but also requires a few custom implementations.
Case study: Brainly talks about managing their experiments when working with SageMaker Pipelines
Stephen: Can you walk us through that particular data infrastructure itself in place? Where do you store all your datasets, if that’s disclosable, of course, and how do you actively go about it? You mentioned Nvidia DALI, which is very crucial for argumentation, is there another stack around that I could share?
Mateusz: Sure. I think I can do it in a simplified manner. Generally, we use S3 on AWS for storing datasets.
We have built our internal solution for datasets versioning, actually, since we didn’t find anything in the space that suited us well enough as of now. We use that solution to obtain datasets whenever we run the job on the SageMaker. We build some of our own silo to extract the running.
Actually, we have the same commands and the same code for running on the local environment, on the EC2, but in the local mode when you’re connected via SSH, and that’s the perfect setup for a data scientist who works in the cloud. You just have a terminal open, you are connected via SSH, you have that GPU just in front of you to be used. And also running in a more reproducible manner via SageMaker, so you can do that via SageMaker Estimator or also as a SageMaker Pipeline when there are multiple steps.
Typically, we are running more production training at SageMaker Pipeline so we can have there some preprocessing of images, or we can just have the training pushing to the model registry, which we used also on the SageMaker.
When we push something to the model registry, we have some automated job there to evaluate our performance on the holdout set. And if it’s all right, if all metrics that, as a data scientist, you look at the run metrics on the Neptune, whether the metrics are okay, then you go to code Pipeline, you approve the model, and it’s pushed to production automatically.
Self-supervised learning: research vs production ML
Stephen: I know this particular field is actively being researched. Are there any things that are actively being researched right now in self-supervised learning and pseudo-labeling, that you can’t actively take into production, or you’d want to do that?
Mateusz: Yes, they are.
In that commercial setting, you are limited that you need to balance between the risky, not risky things. In self-supervised learning, the thing is the training takes a lot of time and costs a lot, so you cannot just go put grid search on the parameters and train like 100 variants of that model because it’s going to cost like GPT training, like $2 million or something like that.
It’s something that you need to work on carefully. But generally, using that self-supervised learning approach it’s something that we definitely want to explore at Brainly since we have lots of data. We know that our domain of images is much different actually than ImageNet and even other domains.
For example, from our experience in the VICE project, when we are doing object detection for the question layout, we tried to reuse the label data on the medical publications, which were actually labeled already for the bounding boxes or some mathematical papers also.
The problem was that that data was actually much different. The solutions which were trained on the data didn’t work well, and even reusing that data for the purpose of detecting something on our data was really random. It just shows that deep learning at the end of the day is just training some hash maps which worked very well on your particular use case.
The biggest MLOps challenge
Sabine: Just to wrap things up here, Mateusz. From your perspective, what would you say is your biggest challenge with MLOps right now?
My biggest challenge right now is connecting all the steps in the whole machine learning model lifecycle.
Lots of my challenges right now are around dataset creation.
- From the data versioning part where we create a lot of datasets using different techniques, and that’s just one stone to be done.
- For creation, you also need automation like SageMaker Pipelines for training we use, you can use SageMaker Pipelines for automation of dataset creation.
- At the same time, the labeling. How would I know that I have enough data labeled and that I don’t need to label more, I don’t need to label more on my own, or I don’t need to pay the freelancers or to label more, and it’s enough? Automated active learning techniques could be there also to be considered, it could be useful in automating your dataset creation.
My current challenges in a machine learning model lifecycle are mostly around data creation. We are pretty much well-organized with the training, pushing to the production and continuous delivery of that.
Also, I’m much of a machine learning engineer, but I work more on the data science side. The challenges around datasets are currently the most challenging every day.
But also the production challenges for actually detecting when your model starts to perform worse in the absence of labels:
- analyzing the predictions shift,
- and the inputs shift.
These are also things that I’m currently exploring.
Sabine: I’m sure you’re not going to run out of challenges to solve anytime soon.
Mateusz: Yes, I’m not.
Sabine: Mateusz, that’s the final bonus question. Who in the world of MLOps would you like to take to lunch?
Mateusz: I think there are plenty of interesting people in that world. Maybe I would point to Matei Zaharia from Databricks, CTO, and they are doing MLflow and Spark. These are pretty interesting solutions.
Sabine: Excellent. How can people follow what you’re doing and connect with you? Maybe online, you could share?
Mateusz: I think it’s good to connect with me on LinkedIn and Twitter. I think on both, it’s just Mateusz Opala, my handle there. It’s the best way to approach me on social.