MLOps Blog

Building Visual Search Engines with Kuba Cieślik

17 min
19th December, 2023

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners. 

Every episode is focused on one specific ML topic, and during this one, we talked to Kuba Cieślik, founder and AI Engineer at tuul.ai, about building visual search engines. 

You can watch it on YouTube:

Or listen to it as a podcast on:

But if you prefer a written version, here it is! 

You’ll learn about:

  • 1 What is visual search?
  • 2 How is visual search different from OCR?
  • 3 Things to keep in mind while designing visual search engines
  • 4 Importance of embeddings in visual search
  • 5 Evaluation of visual search engines
  • 6 Integrating visual search engines with other products
  • 7 Basic tool stack for building visual search engines
  • 8 Scaling visual search systems
  • 9 … and more!

All right, let’s get to it.

Sabine: Hello, everybody. Welcome to MLOps Live. My name is Sabine, and I’m joined by my co-host, Stephen. This is an interactive Q&A session with our guest today, who’s an expert in building a visual search engine. To get started, it is my pleasure to introduce you to our guest, machine learning and data science engineer Kuba Cieslik. Welcome, Kuba. Would you like to introduce yourself with just a few words?

Kuba: Sure. Hello, everyone. It’s nice to participate in this event. I’m a machine learning engineer with some years of experience in building ML products and ML solutions. As it happens, for the last one or two years, I specialized in building visual search solutions, and in general, computer vision is my major topic of interest in this field.

Sabine: What kinds of things are you working on right now?

Kuba: Currently, I’m working on a pretty interesting topic, at least, it spikes some interest whenever I tell it to people. I’m working on face ID for animals, essentially. You can imagine yourself using your phone, and it recognizes you. All the mobile phones now have this functionality that they can recognize you based on the camera feed or the face camera, and essentially a similar thing we are trying to build for animals, where tracking animals becomes more important.

There are certain situations where using other means of technology like NFC tags and writing down whatever is being used now is not practical, for whatever reason, and we have the technology now, or we hope to advance this technology to the level that we will be able to identify animals on an individual level from a single or multiple pictures.

Sabine: All right, very cool. Not to put you on the spot here, but we would love to hear you explain visual search in one minute if possible. It’s a bit of a challenge.

Kuba: For me, I would think of this problem as any search problem.

Whenever we think about searching for something, we want to retrieve relevant information for us, and usually, that’s done via text, at least, that’s how we are used to it, or how we got used to it, but essentially, a visual search is just some subset of this problem, which uses specifically visual information. It can be a video, it can be an image, it can be a drawing, whatever it may be, that for humans, it’s a visual thing. That’s pretty much the problem.

Sabine: All right. I think that did fit inside one minute. Good job there.

Visual search engines use cases

Stephen: It’s interesting, Kuba, your idea in visual search and search engines. I’ll just go to the high level, what are some of the examples of visual search engines that you see around today that people typically use, and they don’t know what the technology behind it is?

Kuba: I think this technology starts popping up more and more now, and it also gets better. For instance, I think the prime example of this technology that gives amazing results is actually Google Lens, which is the most used and probably most advanced technology for that and of course, is a very important sales way and marketing way. You essentially can point to anything and take a picture of it, and Google will try to find it and redirect you to places, usually where you can buy them that’s, of course, a very important reason why this is developed, but there are other things.

There are definitely all those visual search engines, Pinterest is very good at this, also where essentially they build a product around this immersive feeling of just going and just looking, you don’t have to type match, you only have to click whatever is appealing to you and then and then it tries to suggest you things that you might want, because, in the end, the intent is very tricky.

When you search with an image, it’s very hard to get the intent. Are you interested in patterns? Are you interested in this object itself, or maybe it’s something completely different? Usually, in order to make good searches with visual searches, you often also need some context on what a user wants.

I would say. But of course, like other things that are on the other spectrum, like facial identification, they are completely different, there you essentially have only one valid answer, you have a verification or identification problem, and this technology relies on another spectrum of the problem, but the technology that is used to solve it is actually quite similar.

How is visual search different from OCR?

Stephen: Okay. Still at the high level, again, the way I think about visual search engines, I think of more like a combination of different technologies, and that could be like OCR technology, all those technologies. Am I wrong in thinking this way, or what are the different technologies that can actually make up a visual search engine?

Kuba: I’m not sure if I get what you mean, but I think in terms of combining multiple things, then often what you want is actually use multimodal information, you also use the text that is provided, and this is the context information, and then search, and then the visual part itself. I’m not sure, maybe you can elaborate more on that question.

Stephen: Okay. What I mean, for example, Google Lens, let’s take Google Lens into account. When I think of Google lens, as a beginner, of course, I would see Google as some OCR type of technology happening in the background. Do we say, for to build a proper visual search engine, there has to be OCR tech in there, and then there has to be some of the visual tech in there?

Kuba: I think OCR is important to many solutions, for instance, maybe some of you know there’s an app called Vivino for wine label scanning, and you could do this both ways, you could probably solve this problem of matching bottles of wine with the actual bottle, the one that you want to find, you could help yourself with OCR technology, so actually reading the text that is on the bottle and that could help you, right?

It probably would if there is some text, but of course, there are labels without texts, and having the combined solution, or fallback solutions, is quite important, most of the time. That’s what I said, that when I said that it’s rarely a purely visual problem, usually, it’s a visual plus something problem, and most of the cases, which are not like those identification verification problems, where you actually don’t have any other context.

Stephen: All right. Just to be clear, what distinguishes the visual search tech from the OCR tech as a context, just understanding the context around the image, is it?

Kuba: Yes. Well, to actually look at the image because OCR is just used for actually finding text and then trying to do something with texts. When we train models to do visual understanding of networks through visual understanding, we want more than that, we want to find features find connections in the images that are not that clear and it’s not possible to describe with text also. More importantly, we don’t have text for most of the images we have in the world.

Learn more

Building Deep Learning-Based OCR Model: Lessons Learned

What is triplet network training?

Sabine: Okay, so I think we have room for a question from the community. This is new to me. These triplet networks what is your go-to baseline triplet network training? What might one want to look out for when using it?

Kuba: Triplet training is a special type of metric learning. It’s a part of metric learning where we try to learn a metric function, and this usually goes by means of comparing two things. For instance, for faces, we could train such models that you show a network like two pictures of the same person and the third picture of another person, and then a model that will learn to keep close the images in some space of the same person and push it farther from the other ones.

There are actually a lot of good frameworks for that, and the ones that came to my mind are– especially in PyTorch, which I use the most is PyTorch metric learning, and yes, it actually implements most of the novelties in the new solution steps, you might– the new papers or new loss functions, whatever you want it. I really recommend it, it’s quite good, maybe we add a link to it later on, but that would be there.

Might interest you

Implementing Content-Based Image Retrieval With Siamese Networks in PyTorch

Designing visual search engines: things to keep in mind

Sabine: Do you have any tips for setting that up for the visual searches, like a checklist or something?

Kuba: Yes, a checklist. I think I would really figure out what the problem really is and what is being solved, and how to set it up because some things we might want to learn from learning triplet learning networks, some things we might need to learn from a purely unsupervised fashion or from ranking labels. I think, in the end, it boils down to the problem of learning meaningful features from images and what those features actually should be because that’s very problem-specific, right?

A lot of the time, search engines are being shown like just pass some images through a pre-trained network, and then the features coming out of it will cluster this data sample, and that’s true, but if it clusters the way you think it should be, that is another story, right? For instance, if the network was trained on 1000 classes, I mean, it is pretty good in grouping different types of objects, but it will not be pretty good in distinguishing them, so a model that uses this pre-trained network might not be the greatest one, for instance, to rank or trying similar dresses, so to say because it never understood really the concept of different types of dresses.

It might still be quite useful, but most of the time, this type of additional information that we want to gain from it, it needs new data and additional labels at least. That’s the tricky part, and that would be the main consideration, but of course, if we would know more about a certain problem, then we can think a lot.

Sabine: But you’d still fine-tune the pre-trained on new data, right?

Kuba: Yes, certainly, yes. Using a pre-trained network is almost always the way to go, right? Instead of going from scratch, so that’s for sure.

Stephen: I would just love to chime in there, Kuba. I think embeddings are really, really important in this space, of course, while building visual search engines. In your opinion, what is the importance of good and topic-relevant embeddings when dealing with visual search engine technology?

Kuba: I think that’s just a bit on what I said before, about that it is problem-specific, even though some very generic search engines, it might not seem so, but if we want to build very specific solutions, and I think that’s how this is being used now. For instance, if we want to have good recommendations for fashion, then we need to focus on that and even very specific on focusing on a certain even garment type to get it right.

The same goes, for instance, in my work, if you would just use pre-trained networks to do embeddings of cattle faces, then it just doesn’t work at all, right, because there’s no reason why such network would ever learn about what distinguishes one cattle from another, and the same goes for a human face. It’s like the Face ID technology that we have that works so well now.

It only works because it was trained on a billion faces, so that’s why it’s relevant. It’s not like you have a magic way of creating an embedding that will work well in various scenarios, you need to control it somehow.

Stephen: Yes, I’m just following up on this real quick, those are situations where you have labels, you’re training on faces, you have the labels. I think there’re quite a number of use cases where you don’t have access to labels as much as you’d want to. How would you train a very good embedding in such cases where you don’t have labels to work with?

Kuba: Yes, without labels. Yes, that’s a good question. Without labels, of course, you have limited options, right? Let’s say you have a use case that is very new compared to the images normally in the data sets that you find. For instance, something from a factory maybe some uses case where there’s a limited amount of data, then I think the current approach, and especially in the growing field of self-supervised learning, is very helpful here.

You can actually improve embeddings and train embeddings in a self-supervised way. Essentially, you only need the data so that it can be collected, and then the networks, there are different types of ways of how they work. How self-supervised learning works. Like most intuitively, the most intuitive methods they’re not being used that much anymore, but I think for explanation purposes, they’re better.

For instance, you train a network that shows a gap in an image. You cut out the gap, and then you tell, “Yes, what’s in the gap?” Or you let the network color an image, which is also something that you can self-supervise and then hopefully learn something from it, but the newer approach, they use a bit more elaborate techniques. For instance, you essentially work with pairs of images, which one is malformed a bit with augmentations, and then you try to learn embeddings that are the same for images that are technically different.

That’s very challenging for the network because it has to understand the content of the image. Yes, and training with such embeddings and using such embeddings for sure improve the scores or whatever your problem would be. Yes, but there are downsides to that as well. Not many people use it on a large scale, they are usually hard to train, and the self-supervised networks they’re not converging. There’s a lot of trickery coming and going on with them, and I would say it’s like a last resort.

Read also

The Ultimate Guide to Word Embeddings

Training, Visualizing, and Understanding Word Embeddings: Deep Dive Into Custom Datasets

Evaluation of visual search engines

Stephen: It does really seem like a tricky problem to solve. What is the evaluation process, and how do you evaluate that this particular model you’ve trained or this particular problem, you’re meeting those requirements sort of?

Kuba: Yes, well, I think it especially depends on whatever problem you have. If it’s a ranking problem, then, of course, you have some ways of assigning scores to what is happening. For instance, I would assume visual searches, such as Pinterest or Google Lens, certainly can label it and can have some ground truth. Do you know what was clicked or what the user clicked? In fields like identification and re-identification, we usually just use a top one accuracy essentially. It’s the most important factor because it doesn’t help if you’re one-off essentially, in most cases, which can be. That is okay. Most of the time, it is not right. That would be the important thing.

Integrating visual search engines with other products

Sabine: Lev wants to know, does Kuba think that integrating visual searches like Google Lens with other apps and products in an intuitive way has happened properly yet? Any thoughts?

Kuba: Integrating things like Google Lens and making a product that essentially, I think a lot of companies definitely try to improve their to-shop experience, and that happens for sure and in fashion. Even big fashion companies such as H&M have it in their apps. Of course, they are not integrating with Google Lens, they have some inbuilt solutions for this problem, but they have solutions for you to scan something and query their inventory essentially. That’s definitely happening, and it seems like a very valid use case with high value, monetary value.

Issues with visual search engines in production

Sabine: What about when you take visual search to production? What to be aware of what are some common problems and issues that you face there?

Kuba: Definitely,, one of the big problems is that you need to be aware of the fact that changing the models can be quite costly and can be, I think, a bit more costly than it usually is with normal machine learning because normal machine learning if you change the model, let’s say that that’s recommendations. Then for future recommendations, you use a new model, and you don’t necessarily have to change it for the past, or you recalculate it for the past.

For visual search, every time you want to use a new model, you have to recalculate all the staff for all the images. Imagine that doing this on a big scale must be, I don’t know, how for instance, Google does it or for their engines how often they are doing it or if it at all, that’s very hard for me to tell. That’s definitely something that you need to be aware of.

If you change the model, it will change the embeddings. Then they become incomparable most of the time. It’s a critical step, and then, of course, there’s a big issue, especially in large-scale apps, it’s the database size. If you’re querying through million pictures, maybe that’s not a big deal, but querying through tens of millions might become a problem and might get expensive quickly.

There are multiple ways how to narrow it down. You usually don’t need and want to compare against everything. I’m sure if I use Google Lens, there’s a special filter that they put on me because all that information they already know about me that they don’t have to scan against all the pictures. Yes, that’s definitely the two things that are super, super critical.

Building a visual search engine for fashion

Sabine: Do you have some experience with building visual search engines for fashion yourself?

Kuba: Yes. I worked on some solutions where we tried to mine information from Instagram. Essentially, understand and build a visual search engine of data that we mined from Instagram. This was a very interesting use case because instagram is usually non-searchable. You can do a basic search essentially by tags, and that’s it. Even basic text search is not really working if you want to search through millions or thousands of Instagram profiles. We mined this data and then ran it through some pipelines that enabled visual search on top of them.

For instance, if you would want to query the database with a certain dress and the image of a dress, we would like to see whoever posted something that fits this. That’s, of course, important for various reasons. People that are responsible for product marketing, buying, and for pricing. There are a lot of use cases where social media, it’s important.

We do this in other fields. Social media monitoring is a thing for a while now. Social media monitoring, when it comes to fashion, is also a thing now. There are companies doing only that, for instance.

Sabine: We have a question about this fashion project in chat from Oluwa Sim. Could you please explain the processes you use for the deployment of that fashion project using visual search if you can tell us more?

Kuba: Roughly, to describe, there were two components. The relevant component now is the indexing parts. What happened is that because, of course, one part was doing the mining and scraping of data and then ingesting it in some databases. Whenever a new picture came up, we actually ran detections to detect certain garment types on the image because if you think about a fashion picture on Instagram, it usually will have multiple items in it. It’s very hard to build systems that have this global understanding of fashion we don’t.

Of course, there are companies working on it to recommend you for an outfit or if this item fits this item, but this is a bit of a different problem. In our case, we had models that were doing detection and segmentation, first of all, the items. The items could be pants, dresses, hats, T-shirts, whatever. In the end, there are not so many types of garments. If you think about it, I think they were 18 covers, a very large amount of garments.

Then these were extracted from the image. You essentially can think of it that one image gets split into N images. Then you need to keep track of and keep also storing those segmented parts. You also need to store the embeddings for those separate items because you never know how you will end up retrieving them. Maybe you will just run queries against a certain type.

For instance, if someone wants to find a similar dress in all those pictures, then you will select that I’m looking for dresses, and you will only look for embeddings in this index. We can narrow it down, but maybe not. Maybe it could be a broader search. Essentially, as you see, once you mined one picture, later, it might actually become even more data than this one. That’s something to be aware of.

Of course, there are different strategies for that too. You could combine embeddings from different things to get this global descriptor that fashion hour. In the end, this is a process of creating a data lake but for images that you can. You don’t know yet how you will use it. You want easy access to it, and you don’t want to recompute because running segmentation and detection models is actually one of the most expensive moments in the pipeline. You do it once, then you think later about what solutions will work.

Stephen: Still on the fashion industry use case, I think Andrea asks the brilliant question here on YouTube, from the software engineering point of view, how did you approach crawling your images from Instagram different images to the video training set, and what was that process like for you?

Kuba: Yes, that’s a good question. That project was done already quite a while ago, I’m not sure if that’s still valid, but I will be straight about that because Instagram, of course, doesn’t like crawling and mining, and they will fight you back, but usually like the simplest approach is just to use as a pool of, how it’s called? They’re called HTTP proxies. You essentially rent out the service that would proxy your requests through different places in the world.

From an engineering perspective, that’s actually quite easy because they give you one endpoint that you redirect all the traffic through, and usually, there are services that do it quite well. You would make one request to this specific user, and they will redirect this request through a server in China or in whatever country. They will keep doing this, keep balancing and keep using different servers.

That kind of works until there’s a complete closure of the API of Instagram. I guess that still works, right? I mean, the only thing that is not working is if there are private accounts that you can’t access without logging in, then, of course, it’ll work, but if you need to log in, then you have a problem. Our intuition was that the accounts that we want to have data from are open because they’re from other companies they’re from big influencers, and they don’t want have their accounts private.

Biggest challenges while training visual search models

Stephen: Awesome. I think we have a really good question from the community, which I’ve also had in mind as well and that’s from Mashhad. I hope I pronounce your name right. Anyways, that’s a really good question. Like, what are the biggest challenges you have in training models for visual search models? Are there some new techniques that you try to combat them with? What do you look out for, in terms of computes, the techniques you use, and so forth?

Kuba: I see your question.

I don’t know if I can answer this because it’s a very, very broad question, but I think the most important thing is, as usual, like in almost all computer vision and machine learning, the data, right? Getting the data right is usually the biggest challenge. If you are collecting data, then, of course, that’s one problem.

If you don’t, if you already have data, then labeling data in the right way, if it’s semi-labeling or if it’s real labeling by labeling people, then this is the critical step that you really think needs to think about much a lot. I mean, because it can be very costly if you do it wrong, it’s the best advice I can give. Because training, later on, depends on which models we’re talking about.

For instance, triplet and this type of triplet quadruplet or Siamese type of networks are relatively tricky to train because they have the problems of sampling the triplets in the right way. It grows into a very big problem the more classes you also have in the data sets, but they are for very large data sets. That’s what you have to use, if you have decently sized data sets used for learning good embeddings, the best you can choose from is essentially the networks that are being used for facial detection, facial recognition, face ID, so DeepFace, ArcFace, CosFace.

They are those networks that are trained like a classification network, but in the loss function, you actually learn embeddings that are separating them by a bigger margin than using regular cross-entropy classification. These work very well, and you can see that, for instance, by checking the recent landmark identification on competitions, on Kaggle, for instance, where they were also used, and the data sets are enormous in this competition.

This technology that was used for facial identification actually got essentially the same, the basis was the same, of course, as usual on Kaggle, there was a lot of trickery going on on top of that, but the base was facial identification technology applied to this problem of landmark matching. Something that’s from the outside, seems like a completely different problem, but the solutions are the same and work really, really well. Those ArcFace CosFace networks, they’re actually relatively easy to train, not as easy as cross-entropy classification, but close to it.

Check also

Building and Deploying CV Models [Lessons Learned From Computer Vision Engineer]

Stephen: I just have two follow-ups on that, and maybe you can correct me if I’m wrong, I’m kind of thinking, how do you select, say, your training architecture in terms of how you decide you’re going to train because I assume that you deal with an enormous amount of data sets. So, when you’re dealing with large data sets, how do you determine what that training process would look like?

Kuba: I think, like here, the limitation when it comes to training and this problem is especially when we talk about this ArcFace or classification-like network, in the end, the number of labels that you have, so let’s say you have a million IDs or people or landmarks in your data set. That means that your last layers are getting really big in the network, and there’s not very much you can do about it if they don’t fit in one model, then you split the model on multiple devices, and you still keep training it.

That’s what people do if they have to train on millions of IDs, which is already quite significant, I would say, but the other networks, if you want to learn, go more on this metric learning approach with training with triplets, then there are no limitations. I mean, the biggest limitation will be the batch size then, which doesn’t have to be crazy big, but you are not having such a big model, and that’s my way.

Stephen: We have a few questions from the community. I would just love to ask, are there common transformation methods you use a lot more than others in this space?

Kuba: The transformations and what?

Stephen: In terms of your data argumentation processes and so forth.

Kuba: I don’t think there’s really a role for that. Nowadays, what you see is definitely shining, and when it comes to, for instance, Kaggle, in my experience also is that those cut-out methods got more attention than they ever did. They’re also quite new, so essentially like cutting out parts of the images, and they’re being used quite a lot, but other than that, augmentation is always the case of the data you have.

For instance, animal pictures that I use need to be very careful about what doesn’t make sense and what transformation makes sense. If you have huge amounts of data, then you also know you can use less augmentation. For instance, for this landmark, if anyone, is interested in whatever new is going on, I think in visual search and matching, then those competitions are really a good resource to learn, and you see that the augmentation in the last visual landmark matching competition was very small. They used very basic and very low degree, augmentations.

Vector similarity tools and solutions

Stephen: Just as we are looking to wrap this up, Kuba, are there currents, say landscape, for vector similarity tools and solutions we should look out for? Then what’s the research angle? How’s research going on in that area currently?

Kuba: I mean, for data similarity, there’s a bunch of solutions now, starting from some of them being paid, some free open source. They all had some quotes to them but some of them were built on top of already existing databases. I think what I would recommend for anyone starting in this is that if you’re not reaching some gigantic scale-like million-plus vectors that you need to search through, I think those solutions that use a database that you already have, most likely, for instance, Postgres.

There’s definitely an engine built in Postgres, there’re engines built on, they’re not official, of course, but they are like services that are using some database as the persistence layer. I will use them if it fits through what you already have in the stack, right? Because it’s always nice to have those embeddings stored in a place where other data is because it will give you more options for querying.

If you have really big and you know that you will go back, then you can buy managed services for that and pay for them. Then you need to pay for each index. Usually, for each index and for the amount it takes in memory because most of them will be running from RAM. You need to pick instances for big indexes.

Basic tool stack for building visual search engines

Stephen: The build vs. buy solution is one of the talking points in the software industry as well. Just talking about your day-to-day workflow, what’s the basic tool stack you use? What is the technology you use when working on building a visual search engine?

Kuba: I use Python and PyTorch, and for similarity, I think I tried most of the big providers now, many of them are quite nice. From the managed solutions, pine cones are also quite a good choice. Besides that, for day-to-day stuff, Python, PyTorch, and Fast API help build most of the stuff I am currently building.

Explore more tools

The Best MLOps Tools and How to Evaluate Them

Best MLOps Tools For Your Computer Vision Project Pipeline

Scaling visual search systems

Stephen: I think another issue that’s regularly discussed, as well, is thinking about building stuff that scales to whatever problem we are solving. In terms of moving from that approach of building a baseline system that works to something that scales to see a lot of users, for example, maybe not Google scale, Yahoo, Bing scale, or things like that but something that works for users. How do you transition from just building that proof of concept to something that scales? 

Kuba: I think the proof of concept would be, I would say, let’s assume that the proof of concept can be a simple thing that you have together, and you put your notebook, and your index of embeddings is actually in memory in NumPy, and you just do a dot product with a query and that’s it. That can be your very, very much baseline situation. Then whenever you move, then you need to think about things like how you will start working on the model itself, you need to keep thinking about persisting and updating embeddings.

Whenever you ship a new model, you need to work slowly, ultimately, the pipeline’s that you have, and of course, if you reach a very big scale, then you need to think about all the scalability issues of this problem. But there are solutions that help you with that. Whether managed or not. For instance, there are vector search engines that you can just deploy as a Helm chart, and you’re ready to go almost right. And you have your own service to integrate with, and that’s, of course, one way of doing it.

But then I think the scales, the query sizes, or other data sets are probably even more important because they will determine how much you need to re-calculate everything and how often, which is great.

Other questions

Sabine: We have, in chat a question from Lev, who would like to know how Kuba feels Beams visual search matches up with Google Lens if you’re familiar with it?

Kuba: That’s a good question. I honestly don’t know. I think Beam’s been here for a while, but yeah, Google one for sure, it’s quite good. It’s actually amazing because recently, I was actually trying to find a chair, I think, in a hotel that I really liked, and it was just spot on the first try out of all the chairs in the world, I think that’s impressive.

Sabine: Yes, we have some users here in chat asking for advice. We have Chimobe asking about building an image validation system for an online marketplace platform.

Kuba: I will try to figure out what this problem might be. What I see as a problem in the marketplace like you would, let’s say, eBay or whatever the problem that you might face there is that the number of categories is not constant. That is a valid problem because if it’s not constant, that means that you can’t just use simple classification because otherwise, I would say, “Okay, I’m selling a board that gets classified as a board,” and then it helps in some UI.

In the case of a new category, you wouldn’t support it right away. If you use any metric learning or similarity approach, what you could do is you would need only one instance of a new category. If someone asked something, she didn’t find the category, but she puts it in the value category or some wonderful managers, the website did it. Then we have one image that is associated with a new category.

That’s the typical problem of one-shot learning. Then we could do this with good embeddings. Instead of doing this as a classification problem, what we would do is we would, in the future, compare a new picture with images from all the categories and choose the closest one. Then you have solved the problem of a changing number of categories. That’s the problem I see in the marketplace, but maybe this person can elaborate if that’s what would be meant.

Sabine: Yes, we have the person with us here on the call.

Chimobean: All right, what I’m asking is you have, just like Amazon, people upload pictures of their products, what they sell online. Now some people might make a mistake, or maybe they want to upload a laptop, and then they’re naming the laptop a person, which is misleading. Now, how do we validate these uploads, knowing that there is N number of products, but we don’t know what particular product this person might be uploading at a particular time? Does it mean we have to keep downloading product categories just to get a like that?

Kuba: Okay, I understand that the problem is said that a user makes a mistake by selecting a category, and then the picture is not reflecting this category. Is this the situation?

Chimobean: Yes, yes. Now, the categories of products are using a machine learning model for this, which does mean we have to download all kinds of images for our validation for training our model.

Kuba: I’m trying to figure out if I get the problem. In a situation that-, what do you mean by downloading? I mean, because that seems to me like this is either a classification problem or comparison, so similarity problems. One could validate the input against a category if you have the models trained for that task. I’m not sure if I get this question. Maybe someone who has a take on that? Maybe someone understood this differently.

Sabine: Is it maybe a question of how to go about building the data set that you would need for training the models in the first place and validating them?

Chimobean: Yes. Are we going to acquire data from different categories? Tomorrow, somebody might just come up and upload a category that is not in maybe in our data, a particular product that our model wasn’t trained on. That’s what I’m asking. How do we now validate this? 

Kuba: I think that’s what I said before I’m still kind of valid for this. In the end, for a fixed number of classes, you can work with a classification for classes that are not fixed. Essentially categories in your situation, then you would essentially figure out the embeddings for this new image, and you would see that it doesn’t fit any of the categories pretty well, or the distance would be far by a large margin, and then you could maybe make the call that, “Hey, maybe there’s no match between this category and this image.” That would be, I think, a valid approach for that.

Wrapping it all up!

Stephen: I think we’re, of course, running out of time, but this is a really good conversation. I think you can continue in the MLOps Community. Listeners, and viewers, if you’ve not yet joined the MLOps Community, you can join it here.

Sabine: Yes, we’ll be wrapping up here today. No worries if you didn’t get your question answered. You can always reach out to us on slack in the Computer Vision channel or in the neptune.ai channel. Thank you very much to everyone who asked questions, and thank you very much, Kuba, for joining us and answering all our burning visual search questions. We will return in two weeks with our next episode of MLOps Live. We will be having Jacopo Tagliabue on to discuss all things reasonable scale MLOps, in two weeks.

Was the article useful?

Thank you for your feedback!
What topics would you like to see for your next read
Let us know what should be improved

    Thanks! Your suggestions have been forwarded to our editors