Blog » General » Understanding Vectors From a Machine Learning Perspective

Understanding Vectors From a Machine Learning Perspective

If you work in machine learning, you will need to work with vectors. There’s almost no ML model where vectors aren’t used at some point in the project lifecycle. 

And while vectors are used in many other fields, there’s something different about how they’re used in ML. This can be confusing. The potential confusion with vectors, from an ML perspective, is that we use them for different reasons at different stages of an ML project. 

This means that a strictly mathematical definition of vectors can fail to convey all the information you need to work with and understand vectors in an ML context. For example, if we think of a simple lifecycle of a typical ML project looking like this:

NLP workflow

…then there are three different stages. Each one has a slightly different use case for vectors. In this article, we’ll clear this all up by looking at vectors in relation to these stages:

  1. Input: Machines can’t read text or look at images like you and me. They need input to be transformed or encoded into numbers. Vectors, and matrices (we’ll get to these in a minute) represent inputs like text and images as numbers, so that we can train and deploy our models. We don’t need a deep mathematical understanding of vectors to use them as a way to encode information for our inputs. We just need to know how vectors relate to features, and how we can represent those features as a vector. 
  2. Model: The goal of most ML projects is to create a model that performs some function. It could be classifying text, or predicting house prices, or identifying sentiment. In deep learning models, this is achieved via a neural network where the neural network layers use linear algebra (like matrix and vector multiplication) to tune your parameters. This is where the mathematical definition of vectors is relevant for ML. We won’t get into the specifics of linear algebra in this post, but we’ll look at the important aspects of vectors and matrices which we need to work with these models. This includes understanding vector spaces and why they’re important for ML. 
  3. Output: The output of our ML model can be a range of different entities depending on our goal. If we’re predicting house prices, the output will be a number. If we’re classifying images, the output will be a category of image. The output, however, can be a vector as well. For example, NLP models like the Universal Sentence Encoder (USE) accept text and then output a vector (called an embedding) representing the sentence. You can then use this vector to perform a range of operations, or as an input into another model. Among the operations you can perform are clustering similar sentences together in a vector space, or finding similarity between different sentences using operations like cosine similarity. Understanding these operations will help you know how to work with models that output vectors like these NLP models.

Note: you can find all the code for this post on Github.

Scalars, vectors and matrices

Vectors are not the only way to represent numbers for machines to process and transform inputs. While we’re mainly concerned with vectors in this post, we’ll need to define other structures that represent numbers also. 

This will help us in the following sections, where we need to understand how vectors interact with those other structures. For the purposes of this section, let’s take the Kaggle dataset on house prices as our frame of reference, and see how we would represent this data using scalars, vectors and matrices.

Vectors dataset
Our house price data has 18 potential features (21 less id, data and the price we want to predict) that a model could use to help it predict the price. 

Scalars: For our purposes, scalars are just numbers. We can think of them like any regular value we use. Any single value from our dataset would represent a scalar. The number of bedrooms in our house price data, for example, would be a scalar. If we only used one feature as an input from our house price data, then we could represent that as a scalar value.

print(house_price_df['bedrooms'].values[0])
3

Vectors: There seems to be more than one usable feature in our house price data. How would we represent multiple features? The total square footage of the house would be a useful piece of information to have when trying to predict a house price. In its most simple format, we can think of a vector as a 1-D data structure. We’ll define a vector in more detail shortly, but for now it’s ok to think of it as a list that lets us pass more features to our model:

# This is a numpy row vector
print(house_price_df[["bedrooms", "sqft_lot"]].values[:1])
[[   3 5650]]

We can also create a 1-D vector by arranging the data in a column, rather than row, format:

# This is a numpy column vector
print(house_price_df[["bedrooms", "sqft_lot"]].values.T[:2, :1].shape)
[[   3]
 [5650]]

Matrices: So far we’ve just looked at the first house in our dataset. What if we need to pass in batches of multiple houses, with their bedroom and square foot / meter values? This is where matrices come in. You can think of a matrix as a 2-D data structure, where the two dimensions refer to the number of rows and columns:

print(house_price_df[["bedrooms", "sqft_lot"]].values[:2])
[[   3 5650]
 [   3 7242]]

This is now a 2-D data structure with rows and columns:

print(house_price_df[["bedrooms", "sqft_lot"]].values[:2].shape)
(2, 2)

We can add more rows, look at the first 3 house price data, and see how this changes the shape of our matrix:

print(house_price_df[["bedrooms", "sqft_lot"]].values[:3].shape)
(3, 2)

One thing to note is that the vectors we created earlier were actually 1-D matrices. In this sense, a vector can also be a matrix – but not vice versa. A matrix can’t be a vector. 

Now that we have clarified some terminology, let’s look at how we use vectors as inputs to our deep learning models.

You can see how TensorBoard deals with scalars and vectors

Inputs: Vectors as encoders

As we noted earlier, the way we use vectors in ML is very dependent on whether we are dealing with inputs, outputs, or the model itself. When we use vectors as inputs, the main use is their ability to encode information in a format that our model can process, and then output something useful to our end goal. Let’s look at how vectors are used as encoders with a simple example.

Imagine we want to create an ML model which writes new David Bowie songs. To do this, we would need to train the model on actual David Bowie songs, and then prompt it with an input text which the model will then “translate” into a David Bowie-esque lyric. At a high level, our model would look something like this:

Vectors - Bowie model

How do we create our input vectors?

We need to pass a load of David Bowie lyrics to our model so that it can learn to write like Ziggy Stardust. We know that we need to encode the information in the lyrics into a vector, so that the model can process it, and the neural network can start doing lots of math operations on our inputs to tune parameters. Another way to think of this is that we’re using the vector to represent the features we want the model to learn. 

The model could be trying to predict house prices, using the house price dataset we used earlier to explain the difference between scalars, vectors and matrices. In that case, we might pass it information such as the house size, number of bedrooms, postcode and things like this. Each of these is a feature that might help the model predict a more accurate house price. It’s easy to see how we would create an input vector for our house price model. We already showed how we could encode this data when we created a vector which contained the values for the number of bedrooms and the square ft/m of the house:

# This is a numpy row vector
print(house_price_df[["bedrooms", "sqft_lot"]].values[:1])
[[   3 5650]]

We would describe this as a 2-dimensional vector, since it’s encoding 2 parts of our house price dataset. We shouldn’t confuse this with the fact that a matrix is a 2-D data structure. When we say a matrix is 2-D, we mean it has 2 indices, rows and columns. 

To identify a single element in a matrix, you need to specify its row and column location. Whereas in a vector, it only has one index, so you only need to specify one index. Think of it like addressing a list where you just identify the list index. Whereas a matrix is like a numpy array where you need to specify the row and column value.

But what about a sentence?

The house price dataset already contained all the columns we needed to create our feature vectors, which we could use as input to our models. They were described in the dataset itself. As a result it was easy to transform or encode the data into a vector format. Our Bowie lyrics, however, are not as easily transformed into a vector. Instead we need to find some rule to encode the information contained in the lyrics. 

One simple way to do this would be to assign a number to each word, and use that to create a vector. This can be done with count vectorizing, you can find a method for this in Scikit Learn

Bookmark for later

Read how to integrate Sklearn with Neptune, and track your classifiers, regressors, and k-means clustering results.

The exact implementation isn’t important here, just that we want to find a way to represent the sentence in a similar fashion to the house data. As an example, let’s take two lines of lyrics from Bowie’s “Memory of a free festival”:

Vectors - Boweie lyrics

First we need to create a fixed length vector. In the vector we created from the house price data example, we had 2 data points for each house. We could easily have made this an 18 dimensional vector by just adding in more features from the dataset. 

To try and identify the length of our Bowie lyric vector, we can take the above two lines to represent our entire corpus of Bowie lyrics. Then we have 9 unique words or features.

def get_vocab(text):
    vocab = set()
    for line in text:
        vocab.update([word.lower() for word in line.split()])
    return(vocab)

vocab = get_vocab(lyric_list[0:2])
print(f'There are {len(vocab)} unique words in these lyrics:')
print(f'They are: {vocab}')
There are 9 unique words in these lyrics:
They are: {'children', 'in', 'of', 'grass', 'gathered', 'the', 'end', "summer's", 'dampened'}

Our input vectors here will contain 9 features. Obviously, this is a small number in terms of all the possible words you could use at input time. If we try and encode a sentence which contains a word we don’t know, we can just ignore it. If the word does exist, we will count how many times it occurs in the input, and add that value to the vector dimension or feature representing that word in the vocab. 

Our features look like this:

count_vector = vector_lookup(lyric_list[0:2])
print(count_vector)
{0: 'children', 1: 'in', 2: 'of', 3: 'grass', 4: 'gathered', 5: 'the', 6: 'end', 7: "summer's", 8: 'dampened'}

If the word “children” occurs 2 times in the input, then the first dimension of the vector will be 2. If it doesn’t occur, it will be 0. If we come across a word that’s not in the vocab, we’ll do nothing, since it doesn’t have a feature or dimension. This is clearly not ideal if we want to train the best Bowie lyric writing model. We’ll look at better ways of encoding inputs in the next example. 

If our input sentence was “The children played in the grass”, then our input vector representing this sentence would look like this:

print(input_vector(count_vector, "The children played in the grass"))
[1, 1, 0, 1, 0, 2, 0, 0, 0]

The word ‘children’ occurs once, and it’s the first feature of our vector, so it’s 1. “Of” doesn’t appear in the input sentence, so it’s zero. The word “the” occurs twice, so the feature at index 5 has a value for two.

Now let’s use all the lyrics from “Memory of a free festival”, and see what our input vector for the same sentence looks like:

count_vector = vector_lookup(lyric_list)
full_vec = input_vector(count_vector, "The children played in the grass")
print(f'The new feture size is {len(count_vector)}')
print(full_vec)

The new feture size is 129

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

We see many more zeros, since there are more features, and they don’t all occur in every input sentence. This is what we call a sparse vector – it’s not “densely” populated. Most features encode no information since they’re empty (they’re set to zero). 

This is inefficient, since most of the features don’t contain much information. In practice, we can address this by using methods which create “denser” vectors with less zeros.

What does a dense vector look like?

The vector we created for our input into our Bowie lyric model was sparsely populated. Instead of creating our own vectors, we could use a completely separate model to take in a sentence and return a dense vector. We could then use that as an input into our own model to learn Bowie lyrics. 

This is an example of chaining models together, where you use the output of one model as the input to another. It’s also an example of a model outputting a vector, which we’ll get to later in the post. 

The Universal Sentence Encoder (USE) is an example of a model that can take in a textual input and output a vector, just like we need for our Bowie model. The USE will produce output vectors which contain 512 dimensions. These can be considered our new input vectors, instead of our sparsely populated count vectors. 

As we said earlier, these vectors contain 512 features or dimensions. The interesting thing is that the size of the vector doesn’t change depending on the size of the input sentence:

sample_sentences = ["We scanned the skies with rainbow eyes",
                    "We looked up into the sky", 
                    "It was raining and I saw a rainbow", 
                    "This sentence should not be similar to anything", 
                    "Erer blafgfgh jnjnjn ououou kjnkjnk"]

# Reduce logging output.
logging.set_verbosity(logging.ERROR)

sentence_embeddings = embed(sample_sentences)

for i, sentence_embedding in enumerate(np.array(sentence_embeddings).tolist()):
  print("Lyrics: {}".format(sample_sentences[i]))
  print("Embedding size: {}".format(len(sentence_embedding)))
  sentence_embedding_snippet = ", ".join(
      (str(x) for x in sentence_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(sentence_embedding_snippet))
Vectors - Bowie model

If you looked at these vectors in detail, you’d see that there are very few zeros in any of the features. This is why they’re described as “dense” vectors. The other thing to note is that we don’t know what any of these features mean. In our Bowie lyrics example, we defined the features so we knew exactly what each of the input vectors dimensions related to. In contrast, with USE we have no idea whether the 135th dimension relates to capitalisation, length, or a particular word. There may not even be dimensions for these features, we’re just guessing. 

It may seem pointless to create vectors where we know so little about the dimensions. But this is one of the key things about vectors which makes them so powerful in ML. We don’t need to understand anything about the dimensions of the vectors, or how they’re created, to be able to use them. All we need to understand is the operations we can perform on vectors, and the spaces these vectors reside in. But before we look at what happens in our model when we pass it a vector, we need to look at how we can also use matrices as inputs.

A matrix of Bowie lyrics

import numpy as np
np.set_printoptions(threshold=10)
lyrics = []
for lyric in lyric_list[15:18]:
    print(lyric)
    vec = input_vector(count_vector, lyric)
    lyrics.append(vec)

# Turn the list into a multidimensional array or matrix
lyric_matrix = np.array(lyrics)
print(f'Matrix shape: {lyric_matrix.shape})')
print(lyric_matrix)
And fly it from the toppest top 
of all the tops that man has 
pushed beyond his Brain.
Matrix shape: (3, 129))
[[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

As we noted, matrices are rectangular arrays of vectors. They’re data structures that enable us to store information in a 2-D format. For our Bowie model we may want to pass in batches of sentences rather than one at a time as the input. To do this, we would store the information in a matrix and pass it to our model. The model can then process the matrix in a similar way to a vector. In image processing matrices are used to represent images as inputs to a model. For example, the famous Minst dataset uses a 28 X 28 matrix to represent the greyscale digits. Each element in the row is a value from 0-255, representing its grayscale intensity. 

Model: Vectors as transformers

At this point we’ve represented our input as a vector, and want to use it to train our model. In other words we want our model to learn a transformation which uses the features in our input to return an output that achieves some goal. We’ve already discussed an example of such goals:

  1. Predict house prices: We showed how we could create a vector which encoded the features needed from our house price data. In this scenario, we wanted to learn a transformation that used those features to output a predicted price for that house.
  2. Create Bowie-esque lyrics: For our Bowie model, output was text, and the transformation was to turn an input sentence into a Bowie-esque lyric.
  3. Sentence encoder: The USE model transforms an input sentence into an output vector. Its goal is to create an output sentence which, using the features of the input sentence, represents that sentence in something called a vector space. 

Neural networks are a way to create models which can learn these transformations, so we can turn our house price data into predictions, and our dull sentences into Bowie-like musings. 

Specifically, it’s the hidden layers of neural networks which take in our vectorized inputs, create a weight matrix, and then use that weight matrix to create our desired outputs:

Hidden layers
The hidden layer is where the computation takes place to learn the transformations we need to create our outputs. Source DeepAi.org

We don’t need to get into the specifics of how neural networks work. But we do need to learn more about vectors and matrices, and how they interact, so we can understand why we use them and – more importantly – how we can use the output of these networks. As a result, in this section we’re going to:

  1. Define vectors: We need to briefly define vectors from a mathematical perspective, and show how the context is different from how we used them as inputs. 
  2. Vector spaces: Vectors live in vector spaces. If you understand how these spaces work, then you’ll be able to understand most of the reasons why vectors are important in ML.
  3. Vector and matrix operations: We manipulate vectors in these spaces by doing things like multiplying them with matrices. They squeeze, squash, move and transform vectors until our models have learned something we can use for our outputs. These operations will be important for the final section, where we will look at common operations which we use when our models output vectors, like the USE model did with our Bowie lyrics.

The goal of this section is to help us understand that the way we use vectors to learn functions in our ML models is slightly different from the way we use vectors in our inputs and our outputs. 

This is why the context of the pipeline stage in your ML project is important. The useful knowledge about vectors can be different depending on what you want to do with your vector. Knowing that will, hopefully, help you with your future ML projects.

What is a vector?

Previously we defined a vector as a list of numbers or a 1-D data structure. This helped us understand how we could encode information from our datasets as inputs to pass to our ML models. 

Within these models, once that input is received, it’s important to understand that vectors are objects which have the unusual distinction of having both a magnitude and a direction:

Vector diagram
Most people will be familiar with the definition of a vector as being something that has both magnitude and direction. Source MathInsight.org

We won’t spend time on this definition, since most people are familiar with it, but we’ll note:

  1. Geometry: There’s a geometric aspect to this definition, since it has a direction. This means it inhabits a geometric space, and we can do operations on the vector which can change its direction and/or magnitude. It also means these spaces can have dimensions. The example above shows a vector with two dimensions, but it can have three and even more dimensions which make it difficult for humans to visualize.
  2. Relevance to input: We can also see here that understanding a vector has a magnitude and direction doesn’t really help us when encoding inputs as vectors. We didn’t need to understand the direction of the vectors we were creating from the house price data to create a useful input. 

The MathInsight website has a good tool to explore some important properties of vectors. You can manipulate the vector to move it around the graph, but note that the magnitude and direction of the vector don’t change:

Vector move 1
We can see the vector has a magnitude and direction, and is located in a certain part of the graph.
Vector move 2
We note that we can move the vector to a different part of the graph, but the magnitude and direction are still the same. The point in space to which the vector now points is different.

As we can see from the above examples, a vector inhabits a space known as a vector space, and we can manipulate the vector to move it around that space. Understanding the nature of this vector space is key to using vectors in ML. We need to define what we mean by a vector space.

Vector spaces

We can think of a vector space as the area which contains all of our vectors. We can think of these spaces in relation to the dimensions of the vector. If our above example of a vector had two dimensions, it could be described by X and Y coordinates like most 2-D graphs you would draw on matplotlib:

Vector space
2-D vector spaces are similar to common graphs we would draw in matplotlib. Notice here that we have drawn the vectors as points without the arrows. This is to show that we can also think of vectors as points in these vector spaces.

The key thing about vector spaces is they’re a way to define a set of vectors within which we can perform operations such as vector addition and multiplication. As a result, we can have 3-D vector spaces where the vectors are defined by 3 coordinates. The only difference between these vector spaces is their dimensions. 

We can perform the same operations on the vector in each of these spaces, such as:

  1. Addition: We add two vectors together to create a third vector.
  2. Multiplication: We can multiply a vector by a scalar to change the magnitude of the vector.

These vector spaces have properties which define how all of these operations can take place. In other words, a vector spaces has the following properties:

  1. Addition must create a new vector.
  2. The addition is commutative, i.e. Vector A + Vector B == Vector B + Vector A.
  3. There must be a zero vector, the identity vector which when added to another vector returns that vector, i.e. 0 + Vector A == Vector A + 0 == Vector A.
  4. For each vector there’s an inverse vector which points in the opposite direction.
  5. Vector addition is also associative, i.e. Vector A + (Vector B + Vector C) == (Vector A + Vector B) + Vector C.

Vector spaces are important for ML, since they’re the spaces within which all of our transformations take place, where input vectors are manipulated to create the weights we need for our outputs. We don’t need to get into the details of these transformations, but we do need to mention some of the operations that will be important in the next section, where we try and perform operations on our outputs to realise some end goal. 

Vector and matrix operations

There are key vector operations that will pop up frequently in most ML projects, so we need to cover them here. In the next section, we’ll look at some worked examples of these operations, using the USE model to produce 512 dimensional sentence vectors. The operations we’ll look at are:

  1. Vector norms: In the first section, we noted that you can think of vectors as lists of numbers which represent features. The number of features relates to the dimension of the vector and, as we’ve just seen, the vector space within which the vector is located. Since these vectors have a magnitude, we can get the size of that using vector norms. These norms then represent the size or length of that vector, and can be used for things like simple comparison between different vectors. Since a norm represents the size of a vector, it’s always a positive number. This is true even when all the values in a vector are negative. There’s a number of ways to get the norm of a vector, the most common are:
    1. L1 norm = sum of the absolute values of the vector, also called the Manhattan distance.
Vector - Manhattan distance

This post is a good overview on the difference between these norms and their use in distance metrics.

  1. L2 norm = square root of the sum of the squared values of the vector, also known as the Euclidean distance. The red line in the example below
 Euclidean distance

This is another post that talks about the L2 norm in more detail.

  1. Inner product: The inner product is another way to perform an operation on two vectors and return one value. Mathematically, we multiply each element in Vector A by its corresponding element in Vector B, and sum the results together. Unlike the norm, the inner product can be a negative number. You might hear this referred to as the dot product, which is just a specific case of the inner product in Euclidean space. Geometrically, we can think of the inner product as the product of the magnitude of the two vectors and the cosine of the angle between them. 
 Euclidean vector
 Euclidean space
Wikipedia definition of dot product in Euclidean space
  1. Cosine: You can think of the dot product as the cosine of the angle between both vectors multiplied by the length of both vectors. If the magnitude of the vector is important, i.e. you think it carries some information, then you should use the dot product. If, in contrast, you just want to know how close the two vectors are in terms of direction, then you’d use the cosine similarity measure. The cosine is just a manipulation of the dot product to ignore the magnitude, i.e. the cosine angle between two vectors is the same, even if the size of the vectors is different:
Cosine
DeepAI.org Cosine similarity 
  1. Vector normalisation: We know now that vectors have magnitudes, and different vectors can have different sizes. Sometimes we don’t care about the size of a vector, and are only really interested in it’s direction. If we don’t care about magnitude at all, then we can just make each vector the same size. We do this by dividing each vector by its magnitude, thus making each vector have a magnitude of 1, or transforming them to a unit vector. Now the cosine similarity and dot product will return the same value:
Vector normalization
Source: Example of vector normalization 
  1. Dimensionality reduction: Our house price vectors had only two dimensions, since they contained only two feature values. The USE vectors we generate, in contrast, have a whopping 512 dimensions. It was easy for us to draw our 2-D graph but what about a 512-D graph? Not so easy. As a result we need to use certain techniques that reduce the dimensionality of our vectors without losing too much important information. We will look at one of the most common ways to perform this reduction:

Principal Component Analysis (PCA): PCA attempts to identify hyperplanes which explain most of the variation in the data. Similar to the way we try to find a trend line in linear regressions. The principal components in PCA are the axis of these hyperplanes. For now, just think of it as a way of finding values which represent most of the variation in the vector with fewer dimensions:

Principal Component Analysis
Source: A great post on PCA if you want to understand it in more detail

We covered a lot of ground in this section, so before we look at the final section, let’s do a little recap. So far we:

  1. Defined vectors in more detail and showed how they can be represented in a vector space.
  2. Showed that ML models need to learn from the features of our input vectors to create transformations that turn input data into the outputs we need.
  3. Defined the vector spaces within which these transformations occur.
  4. Looked at some of the common operations we can perform on vectors in these spaces.

Now we can look at the result of all of this work, namely our outputs. We’ll show worked examples of all of the operations we discussed, and show how they can be used in many situations as part of your next ML project. 

While you don’t need to understand the mathematics of all of the operations we just discussed, it will help to have a broad conceptual underpinning of why they’re used. This will let you choose the right operations for the relevant task when it comes to your outputs.

Outputs: Using vector operations 

As we have been noting throughout this post, how you use vectors in ML changes depending on the pipeline stage. In the last section, we saw that vector math, operations, and transformations are key to understanding what’s going on “under the hood” of deep learning neural networks. These computations are all taking place in the “hidden layers” between the input and the output. 

But, for most ML projects, you won’t need this level of detail. Sure, it’s definitely good to understand the maths at the heart of deep learning algorithms, but it’s not critical to getting started with these models. 

For example, as we showed in the input section, you don’t need detailed knowledge of vector maths to be able to encode inputs in a vector format. The magnitude of your vector isn’t important when you’re figuring out how best to encode a sentence so that your model can process it. Instead, it’s important to know the problem you’re solving with encoding, whether you need a sparse or dense vector, or what features you want to capture. 

And now, similarly, in relation to the output of your model, we’ll look at the aspect of vectors which is most impactful for this stage of the ML pipeline. Remember, the output of these models might not even be a vector, it could be an image, or some texts, or a category, or a number like a house price. In these cases you don’t have any vectors to deal with, so you can carry on as before.

But in case you do have a vector output, you will primarily be concerned with two goals:

  1. Input to other models: You may want to use the vector output of one model as the input to another model, like we did with the USE and our Bowie model earlier. In these cases you can refer to the input section, and think of the vector as a list of values and an encoding that represents the information in question. You may also use the output to build a classifier on top of this input, and train it to differentiate between some domain specific data. Either way, you’re using the vector as an input into another model. 
  2. Input to vector functions: If the output is a vector and we’re not using it as an input in another model, then we need to use it in conjunction with a vector function. The USE model outputs an embedding (i.e. a vector), but we can’t interpret this 512 array of numbers in isolation. We need to perform some function on it to generate a result which we can interpret. These functions, as we mentioned in the last section, can do things like identify similarity between two vectors, and reduce the dimensions of a vector so we can visualize them. It can be confusing to know which operation you need for a given purpose. As a result, we will look at some worked examples of these operations in this section.

Generating vectors for our operations

We’re going to need some example vectors to pass use in our vector operations. We can do this by creating some sample sentences, passing them to the USE, and storing the vector output. This way we can compare the result of each operation to the sentences themselves. We will create a mix of sentences, some which seem similar and others which appear very different semantically:

  1. We scanned the skies with rainbow eyes – Bowie lyrics from “Memory of a free festival”
  2. We looked up into the sky – A non Bowie lyric that seems semantically similar
  3. It was raining and I saw a rainbow – A sentence that seems more different from the lyric
  4. This sentence should not be similar to anything – This sentence should have no semantic overlap
  5. Erer blafgfgh jnjnjn ououou kjnkjnk – A complete gibberish sentence 

Now let’s see what our vector operations tell us about these sentences.

Vector norms

We saw earlier that vector norms are a way to identify the ‘size’ of a vector. After generating the embeddings for our vectors, we can get the L1 and L2 norms as follows:

# Get the L1 norm
for sentence, vector in zip(sample_sentences, sentence_embeddings):
  l1 = norm(vector, 1)
  print(f"L1 norm of '{sentence}': is {l1:.4}")
L1 norm code
# Get the L2 norm
for sentence, vector in zip(sample_sentences, sentence_embeddings):
  l1 = norm(vector)
  print(f"L2 norm of '{sentence}': is {l1:.4}")
L2 norm code

We can see two things about these norms:

  1. The L1 norm values are close together in values,
  2. The L2 norm values all add up to one.

This may seem strange, but given what we now know about vectors, let’s see what they say about the USE embeddings on the Tensorflow page:

Semantic textual similarity task

It seems that the embeddings are normalized already (approximately normalized, I’m not sure what they mean by approximately, but let’s just assume it means they’re normalized). This means that the vectors are divided by their magnitude, and then when we get the L2 norm of that it will always be 1, i.e. the unit vector. 

As a different example, let’s get the L2 norm of our vectors using a different model, which doesn’t normalize the vectors in the same way. There’s a great library to do this, called Fast Sentence Embedding (FSE). The L2 norms of these vectors are not all 1:

# Get the L2 norm
for idx, sentence in enumerate(sample_sentences):
  l2 = norm(model.sv[idx])
  print(f"L2 norm of '{sentence}': is {l2:.4}")
L2 norm code

What does this tell us about the vectors? Not much, actually. It’s not great at identifying the semantic differences between our sentences. This may indicate that the magnitude of the vector is not a reliable method to identify semantic similarity between the sentences. 

Inner Product

If the magnitude alone doesn’t tell us much about the vector, let’s try the inner product. Remember, this was the sum of the products of two vectors:

res = {}
for sen in sample_sentences:
    res[sen] = []

for i, s1 in enumerate(sample_sentences):
    for j, s2 in enumerate(sample_sentences):
        dot_prod = np.dot(sentence_embeddings[i], sentence_embeddings[j])
        res[s1].append(round(dot_prod, 3))
    
pd.DataFrame(res, index=[s for s in sample_sentences])
Vectors - inner product

Getting the inner product of each pair of sentences, we can see that we get 1 when they’re identical, and then other values when they’re not. If you look at the score of our gibberish sentence, you can see that we get a negative score for some pairs. We also get a low score for the “this sentence should not be similar to anything” sentence. Our Bowie lyric sentence and our less eclectic attempt do generate a relatively high score, in fact the highest score of all the comparisons. Maybe we have some lyrical creativity in us after all.

Cosine similarity

An alternative score is the cosine similarity score between two vectors. As we noted before, this ignores the magnitude of the vector, so it should give a different result to the inner product. But remember, our embeddings are normalized, which means we should get the same score for the cosine similarity as the inner product. This can be good to know, since the inner product is a more efficient operation if you’re performing large numbers of similarity comparisons.

# This is an example of how to get cosine similarity with TF
def get_cos_sim(sen1, sen2):
  sts_encode1 = embed(tf.constant([sen1]))
  sts_encode2 = embed(tf.constant([sen2]))
  cosine_similarities = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis=1)
  return cosine_similarities

res = {}
for sen in sample_sentences:
    res[sen] = []

for i, s1 in enumerate(sample_sentences):
    for j, s2 in enumerate(sample_sentences):
        cosine = get_cos_sim(sample_sentences[i], sample_sentences[j])
        res[s1].append(round(cosine.numpy()[0], 3))
    
pd.DataFrame(res, index=[s for s in sample_sentences])
Vectors - cosine

And it does look like the scores are the same!

Vector Normalisation

So we know that our embeddings are already normalized or, as they were described in the TF page, approximately normalized. In that case, there should be no difference between the raw embedding and if we normalize the raw embedding. How can we measure this? Well, why not use the magnitude of the embedding, or the L2 norm as we did earlier.

non_normed_vector = embed(tf.constant([sample_sentences[0]]))

normed_vector = tf.nn.l2_normalize(embed(tf.constant([sample_sentences[0]])), axis=1)

non_normed_vector - normed_vector

And yes, the resultant magnitude of the difference between these two vectors is tiny, so it does look like these are already normalized. When we normalize a vector, we just divide each element in the vector by its magnitude. So you can also do it like this:

x = np.array([7, 6, 10])

nrm = norm(x)
print(f'Norm of vector {x} is {nrm}')
normal_array = x/nrm
print(f'The new normalized vecotr is {normal_array}')
print(f'And the norm of this new vector should be 1.... {norm(normal_array)}')
Vectors - normalize code

Dimensionality Reduction

When we talked about vector spaces, we noted that we could reduce the dimensions of our 512 dimension vectors so that we could visualize them. This is another way to compare the similarity of our sentences. We could do things like clustering to identify what sentences are similar to each other. This can be a better way to find similarity between larger numbers of sentences, since we don’t need to do a pairwise comparison between all of the possible combinations.

The code to do this is available in scikit-learn, so you can reduce the dimensions of our embeddings relatively easily:

from sklearn.decomposition import PCA
import plotly.express as px

def get_3d_viz(X, sentences):
    
    pca = PCA(n_components=3)
    pca_embed = pca.fit_transform(X)
    
    df = pd.DataFrame(columns=['x', 'y', 'z', 'word'])
    df['x'], df['y'], df['z'], df['sentence'] = pca_embed[:,0], pca_embed[:,1], pca_embed[:,2], sentences
    fig = px.scatter_3d(df, x='x', y='y', z='z', color='sentence')
    return(fig)

get_3d_viz(sentence_embeddings, sample_sentences)

And this results in a nice visualization like this:

Vectors - visualization

If you look at the above visualization, you may think that the sentence “This sentence should not be similar to anything” and our gibberish sentence seem close together, and are similar to each other. However, let’s swivel our visualization around, and show how a 3-D image can help us better differentiate between these sentences:

Vectors - visualization

Doing this, we can see that there’s some distance between these sentences and our Bowie lyric, and our own attempts to mimic it are closest together. Just like our cosine similarity and inner product scores indicated. 

This shows the benefit of being able to reduce the dimensionality of the vector to something we can visualize in 3-D.

Summary

So what have we learned from our whirlwind tour of vectors? Hopefully you can see that how we use vectors in ML is different to how you may have learned about vectors in school or college – although that knowledge is still important. The mathematical definition of a vector is key to what’s going on under-the-hood of our deep learning algorithms. 

The key difference, however, is that we use vectors in a different way in the inputs and outputs of these models. Don’t be confused when you see something about eigenvectors or the linear independence of a vector. Instead, know that this information is important, but to encode your inputs and use your output vectors you don’t really need this information. 

You can dip into the knowledge as you build up your experience with vectors but, in the context of ML, knowing what stage of the pipeline you’re interested in is key to working with vectors. 

References

The following list provides some great resources if you want to dig deeper into the details of how vectors are used in ML:

  • You can find a good overview of vectors on MathInsight which provides some geometric descriptions of how to perform common vector operations such as addition and subtraction
  • This is a good post if you want a better description of vector norms and normalization which also covers how to do this in Python via numpy.  
  • If you want to understand dimensionality reduction then checkout this post which has some great visualizations on how PCA works.
  • This is a lecture from the University of Toronto Computer Science department which talks about the different types of neural networks and how they transform input vectors to identify patterns. 
  • This is another great post about transformations in neural networks focusing specifically on feed forward networks and describing how vectors and matrices are involved in these operations.

READ NEXT

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

10 mins read | Author Jakub Czakon | Updated July 14th, 2021

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”

– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

  • use different models and model hyperparameters
  • use different training or evaluation data, 
  • run different code (including this small change that you wanted to test quickly)
  • run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics. 

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.  

This is where ML experiment tracking comes in. 

Continue reading ->
BERT and Transformer

10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape

Read more
Word embeddings custom datasets

Training, Visualizing, and Understanding Word Embeddings: Deep Dive Into Custom Datasets

Read more

Data Augmentation in NLP: Best Practices From a Kaggle Master

Read more
Tools for NLP projects

Best Tools for NLP Projects That Every Data Scientist and ML Engineer Should Try

Read more