In many machine learning (ML) projects, there comes a point when we have to decide the level of similarity between different objects of interest.

We might be trying to understand the similarity between different images, weather patterns, or probability distributions. With natural language processing (NLP) tasks, we might be checking whether two documents or sentences are semantically or syntactically similar.

** ^{READ ALSO}**How to Structure and Manage Natural Language Processing (NLP) Projects

In order to do this, we need to find a function (or operation) for generating a score which “measures” the level of similarity. This function can take many forms, but one common metric that pops up in many tasks is the Wasserstein distance (WD).

In this post, we’ll explore how WD, also known as the Earth Mover’s Distance (EMD), can be used to measure the similarity between two different text sequences. To do this, we will:

- Review distance metrics, and show why they are important for many ML related problems;
- Describe WD, and show how to apply it to a wide range of problems;
- Show how to apply WD to the issue of textual similarity;
- Finally, compare WD to other distance metrics, such as Cosine similarity, and discuss some of the advantages and disadvantages of WD.

Hopefully, my approach will help you understand if WD can be applied to your specific ML project.

*Note: we won’t dig deep into the mathematics of WD. We’ll provide some resources at the end about the maths behind it, just in case you want to dig deeper.*

## Distance metrics

Wikipedia tells us that “*Wasserstein distance […] is a distance function defined between probability distributions on a given metric space M*”. Why doesn’t it say anything about identifying similarity between sequences of text?

Well, it’s not about the method, but about how we represent the problem. If we can represent words like they’re probability distributions, then we can use WD to identify similarity in two seemingly different domains. But wait, what’s a distance function, anyway?

A distance function is something we use to measure the distance between two or more objects. Simple enough, but is that always the case? Depends on what we’re measuring, some things are easier to measure than others.

Let’s think of some trivial examples. What if you live near a park, and want to get to the opposite end? You need a metric function to answer this. Think about it. You could walk to the opposite end, in which case you can walk an almost perfectly straight line right through the park. Or, you could drive there, but in that case you can’t drive through the park and have to use the nearest road.

So, we can create distance metrics to understand the distance needed to travel in each case. For the walk option, we can use the Pythagorean theorem to create our distance metric, and for the driving option, simple addition will do.

OK, this was a nice, clean, unambiguous question. The problem was easy to represent mathematically, just straight lines and simple shapes. If everything was that easy to represent, life would be much easier for ML engineers.

What if the distance we want to measure is abstract and ambiguous? Say you want to know the distance between Ireland and the US. This points to the key issue that distance metrics need to resolve: how do you “represent” the objects (entities) being compared? Entities like countries are not simple shapes, they can be represented in a number of different ways.

We could represent each country as the closest point between their coastlines, and just draw a line from point to point. Or, we could find a center point in each country and represent the problem and the distance between those points. We can also randomly choose multiple points in each country and measure the distance between each one, and then get the average. We can’t say that any of these are the “right” or “wrong” answers, it all depends on how we represent the problem.

## From distance to similarity

Measuring the distance from US to Ireland is one thing, but what does this have to do with understanding how to use WD with probability distributions and ML problems like identifying textual similarity? Using our previous examples, let’s review what we needed to measure the distance between two points:

**A way to represent the objects in a mathematical (i.e. metric) space**: In the park example, we used straight lines and rectangles in a euclidean space. For the countries example, we represented the countries in a scatter plot graph.**A way to measure the distance between objects in that space**: We used simple addition or the Pythagorean theorem to work out the park distance. For the countries, we used the distance between two points or a sample of points.

Think about a probability distribution where we toss a coin 100 times. This can be represented by a binomial distribution. Now, imagine a scenario where you have two coins, one is biased and will land heads up roughly 70% of the time, while the other is a fair coin and will show heads 50% of the time. We put these numbers into the binomial distribution, and we get two graphs that would something like this:

Now we have:

**A way to represent the events**: We can represent the events by graphing the number of heads we see for 100 coin tosses in both the fair and bias trial;**A way to measure the distance**: We need some way to measure the distance between these two distributions. We can think of it as finding a way to measure the distance between each point on one distribution to each point on the other distribution. If you imagine the above distributions to be made from millions of pieces of dirt, then we can re-imagine the problem as a way to move the pieces of dirt from one pile (distribution) to the other. This way, we can generate a score for how similar these distributions are to each other.

## The Wasserstein distance and moving dirt!

We have two distributions, one representing a series of fair coin tosses, and the other a series of tosses with a bias coin. We want to understand how similar they are to each other. Now, we need a method to measure the distance necessary to move all the points on one graph to the other. You can also think of the graph in a histogram format:

Imagine each bin being made up from a series of blocks. We want to move these blocks from one distribution to another. Each block has a mass, so we need to take that into consideration, as well as the distance moved.

This is a tiny difference from the previous metrics we mentioned, which involved single points of no mass so we just considered the distance we needed to move them. Now we can measure the “work” needed to move a piece of the graph (a block) from one distribution to another.

This is like a cost function, since we’re looking to do the least amount of work. It is in this way that WD is referred to as Earth Mover’s distance (EMD).

Instead of coin tosses, we can imagine each of the above distributions as representing piles of dirt, and we want to calculate how much we need to move one pile so that it’s the same as the other. The amount of “work” we need to do is the measure of similarity between the two piles of dirt. The less work, the more similar the two piles of earth, or distributions, or anything that can be represented in this way.

If we sum up all these individual pieces of work, we can get the following formula:

It’s a rough approximation of what the Wasserstein distance tries to calculate. To see a much better explanation of this formula, see this excellent video (but finish this article first!).

Above is a way to think about this in a discrete space. When you get into the continuous space, you start to see things like integration, and the formula you will see for WD is more likely to look like this:

WD, or EMD, can be applied to a number of ML problems. Once you can represent your problem as a metric space, then you can apply EMD to it. One example of this is image retrieval. EMD is used to “move” pixels, and by doing so – identify which images are most similar.

## Digging deeper into EMD

Before we leave EMD and start looking at how to use WD to identify textual similarity, it’s worth looking at another example to help understand what WD tries to calculate.

EMD can be used in image processing. We can think of an image being represented by pixels, and each pixel having a weight representing its brightness. In the RGB model red, green and blue are combined together to create a broad range of colors. Each pixel has a value between 0 and 255 representing its brightness.

In the diagram below, we’re representing mounds of earth and holes that need to be filled. But we could also think of the points as representing pixels, and the size (value) represents its brightness. So, this problem could be an image or a mound of earth, and we’re trying to figure out how much work is needed to turn one image into another, or move earth from the piles to the holes.

The red points y1to y4represent the holes, and the size is how much earth is needed to fill them. Similarly, the black points x1to x8represent the mounds of earth available to fill those holes. Remember, we want to move the earth the least amount of distance possible, this is our cost function – the amount of work we want to do.

To fill a hole, we need to move some earth from the nearest pile. So, for example, we can put all the dirt from x1, x2 and x3 into hole y1, and we still need 7 units of dirt to fill it.

We can take that from x4. The table below shows the amount of earth we need to move from each pile and how far we need to move it.

Earth | Hole | Amount | Distance | Work |
---|---|---|---|---|

x_{1} |
y_{1} |
1 | 1 | 1 |

x_{2} |
y_{1} |
1 | 1 | 1 |

x_{3} |
y_{1} |
2 | 1 | 2 |

x_{4} |
y_{1} |
7 | 2 | 14 |

x_{4} |
y_{3} |
1 | 3 | 3 |

x_{5} |
y_{2} |
1 | 1 | 1 |

x_{6} |
y_{2} |
2 | 1 | 2 |

x_{7} |
y_{4} |
1 | 1 | 1 |

x_{8} |
y_{4} |
1 | 2 | 2 |

Total | 17 | 27 |

This shows that in total, we need to move 17 units of earth, and our work done was calculated as 27. Thus the EMD calculation is 2717 = 1.58.

In our trivial example, it’s easy to optimize for the most efficient way to move the earth. But imagine in real life, with many more variables, and even problems with many more dimensions, and you can see how it would quickly become impossible to manually work it out.

We would need to do many more calculations, simulate every possible scenario, and then choose the one that requires the minimum amount, or lowest EMD. Nevertheless, this example helps to show a simplified version of WD in action, and sets us up nicely to see how to apply it to textual similarity via embeddings.

## Finding meaning in a pile of dirt

The beauty of distance metrics is that we can apply them to any problem once we find a way to represent it. If we can represent something in a discrete manner where it’s in a metric space, then WD doesn’t care whether it’s moving dirt or parsing prose. It will work exactly the same.

Word embeddings are well-suited to precisely this task, since they’re points in a multidimensional vector space, i.e. exactly what we need to apply WD.

We won’t go into detail about word embeddings here, so you can check out some previous posts I wrote on the concept of embeddings in NLP here and here. Briefly though, word embeddings are just vectors which encode meaning based on the text they were trained on. In an earlier post, we showed how you could visualize the embeddings by training them on a simple dataset and projecting them onto a 3D visualization:

So, if we can represent words as points in a vector space, then we can treat each word as a piece of dirt which we want to move. Again, the key here is to think about it in simple terms as a distance function. How much do I need to “move” a word to make it similar to another word?

The assumption here is that a point in space, whether it represents a point on a probability distribution, a piece of earth or a word, is considered identical if the distance between it and another point is zero.

Instead of holes and piles of earth, we’re moving words from one place to another to understand how they’re related. Remember, in the embedding vector space, an embedding’s location indicates something about its relationship to other words. If words are clustered together, then they’re related.

In the above example there was no pattern to the distribution of points. They were randomly related. When we changed the dataset to ensure that there was a relationship between certain words, we got the following visualization:

The original paper which introduced the idea of using EMD to identify textual similarity, “From Word Embeddings To Document Distances”, shows this relationship with a good example that’s often used to show how word similarity can be identified by the distance between embeddings.

You can think of this as the distance we need to “move” each word in the embedding space so that they occupy the same space. The less you need to move a word (you don’t need to move “Chicago” by much to occupy the same space as “Illinois”), the more similar the words are to each other.

We’re moving words as if we were moving dirt! Which is how this method for textual similarity became known as the Word Mover’s distance (WMD).

At this point we should recognize how EMD is similar to WMD. The words “press” and “media” should not need to be moved by much to present the same word, i.e. to occupy the same point in the embedding vector space.

Conversely, if the sentence was “*the President speaks to his family in chicago*” then the distance we would need to move “media” to occupy the same space as “family” should be more than before.

Hence, by adding up all of these distances, we should be able to compare different sentences. And just like the dirt example, the less “word” we need to do, or distance we need to move, the more similar we can consider the sentences.

## How to use WMD

The good news is that it’s relatively simple to start using WMD to calculate similarity between sentences. It’s part of the Gensim library. You can download a pre-trained Word2Vec model and use that off-the-shelf to test out the distance method.

Alternatively, you could also train your own model and use that instead. For our purposes here, we’ll just use the pre-trained one to show how easy it is. You can find the notebook for this example here.

We will use one of the largest pre-trained models available, “word2vec-google-news-300”, which takes a long time to download but you can also use a smaller one if you just want to test this quickly.

Then getting the WMD score is as simple as:

And then you can compare the results and see what impact minor differences have on the similarity score

Note the differences between sentences like “** Can I change my clothes**” and “

**”. They are not seen as being that different according to the WMD but we know they are more different (or at least as different) as sentences like “**

*Can I change my password***” and “**

*Can I change my clothes***”.**

*I would like to reset my password*## WMD v Cosine?

If you tried to measure the similarity between two sentences or documents, you might have used something like cosine similarity.

For example, in a previous post, to get the similarity between the words in our dataset we used the get_similarity function, which used cosine similarity. Instead of calculating a distance between words, this measures angle size between the two embeddings.

If you look at the diagram above you can see that, according to the cosine of the angle between these vectors, both of these words are deemed to be similar. In other words, cosine similarity ignores the magnitude of the vectors, and focuses only on the direction. Even though there is a large difference in magnitude between the vectors in the first diagram, there’s no difference between the angles in both examples.

In many cases, this doesn’t seem to be an issue, and whatever method you use will depend on your own use case. But, interesting research has shown that the length of these vectors is related to the significance of the word in the text on which word embeddings were trained.

More specifically, “*For words to be used in many different contexts, they must carry little meaning. Prime examples of such insignificant words are high-frequency stop words, which are indeed represented by short vectors despite their high term frequencie*s”.

In other words, the more a word is used in different contexts, the more the embedding represents a weighted average of the word, and this average impacts the length of the vector making it shorter. In theory, the cosine approach would miss this magnitude difference, whereas something like WMD would not.

In practice, this might not make much difference to your results if you’re looking to classify text based on sentiment or things like that. However, for individual cases it could impact individual sentence similarity. The best approach would always be to try both methods (or one of the many other distance metrics we haven’t touched on here), and see which words are best for your approach. You could end up using a weighted average of both if you find that works best in all cases.

When comparing these methods, you should also consider the fact that most embeddings used are contextual, in that models like BERT generate many different embeddings to capture the different meaning of a word depending on its usage. This could also have an impact on which method you use.

Again, the best approach is to try both and see which is the best for your specific use case.

## Final considerations

In this post, we explored WD from an ML perspective, to understand how to use it to measure the similarity between two text sequences. To do this, we spent some time on the general idea of a distance metric, and how we can use simple distance approaches to measure seemingly different concepts, such as the similarity between probability distributions and piles of dirt, as well as words and sentences.

The key was to understand the different ways we can represent each problem. If we can represent it in some form of a metric space, then we can apply EMD to it and, based on the amount of “work” we need to do, understand the similarity of entities in that space.

We discussed how this approach can also be used to measure the similarity of words via WMD, some differences in the WMD method, and more common methods like cosine similarity.

The one thing we didn’t address were any shortcomings with the WMD approach. The main one is that it’s difficult to use on large documents due to its computational complexity. However, this is an area of ongoing research and new methods that aim to significantly speed up WMD calculations, thus enabling it for larger texts.

Similarly, further work has attempted to expand WMD to sentences via the Sentence Movers distance (SMD), which claims to show improved results by using both word and sentence embeddings to accurately extract semantic similarity from sequences of text.

This continues to be an exciting and rapidly developing field, and I can’t wait to see what comes next. Thanks for reading!

Other resources:

- Discrete example of EMD: This is a great post showing a discrete example of the EMD. It was this approach I used as the basis for the RGB diagrams. It’s just a great example of good data visualization to help explain a problem or concept.
- WD math deep dive: I found this video lecture on WD very useful and accessible. If you want to get much more detail behind the math for WD I recommend this video.
- More WD math: This was another video I found very helpful to understand how to derive a simple (discrete) form of the WD with a worked example.

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->