MLOps Blog

Training, Visualizing, and Understanding Word Embeddings: Deep Dive Into Custom Datasets

10 min
23rd August, 2023

One of the most powerful trends in Artificial Intelligence (AI) development is the rapid advance in the field of Natural Language Processing (NLP). This advance has mainly been driven by the application of deep learning techniques to neural network architectures which has enabled models like BERT and GPT-3 to perform better in a range of linguistic tasks previously taught to be beyond the scope of most NLP models

While the full potential of these models may be disputed (for more context see our post on whether these models can actually understand language) there is little doubt that their impact on both the academic and business world is only just beginning to be felt.

As a result, it is important to understand how you can test and train these models for your own use cases. One key part of all these deep learning models is their ability to encode linguistic information in vectors known as embeddings

In this post, we will look at different techniques you can use to better understand how well a language model captures the contextual relationship between words. We will do this by: 

  1. Looking at the dataset we need to train these models to see if we can come up with a simple one that helps us visualize how these models “learn” the relationship between different words
  2. Looking at the tools and techniques you can use to track the progress of these models and monitor the results while they process our simplified dataset
  3. After that you should hopefully be able to re-use that template for more complex models with some real life datasets.

There are a number of factors that make this type of project difficult. Firstly, language itself is tricky so it is hard to know how closely one word is related to another. Is the model right to say that “cat” is closer in meaning to “dog” or “lion”? And secondly, how can we identify this contextual relationship? Are there scores we can use to understand similarity and methods we can use to understand this relationship visually? 

We will look at how you can tailor your experiment to control some of these variables to better understand what the models are actually doing when they process text and output some really large vectors. Speaking of vectors …

The wor(l)ds of vectors

This information can include both semantic and syntactic aspects of a particular word. While each word can be considered a categorical variable, i.e. an independent entity, it always has some relation to other words. For example, Dublin is a capital city in Ireland. The word Dublin thus has some relationship with: 

  • things that are cities, 
  • things that are capitals, 
  • things that are in Ireland 
  • and so on… 

As you can see, it is difficult to fully define all the relationships one particular word may have. Nevertheless, this is exactly what we are trying to do when we create an embedding.

The latest models such as BERT and GPT-3 excel at this task since they can process vast amounts of text and encode much of the relationships they learn in these embeddings. Their predecessors were “not so deep” neural networks like Word2Vec which used similar techniques to learn associations between words and encode that information in embeddings. 

From Jay Alammar’s amazing blog: These models, like GPT-3 here, turn the word into a vector and then do their “magic” to learn the relationship between it and other words depending on the context. (Note you should checkout Jay’s blog for all things deep learning NLP)

While these techniques were similar they are easier to understand than some of the more complex and advanced methods used in the latest instantiation of these Language Models (LMs). As a result, it is a good place to start to showcase how you can train embeddings and understand the information they contain as they are being trained. 

Hopefully, this will help you apply these techniques to more advanced LMs in your own ML projects, track your progress, and visualize the outcome.

Project description

You may have already come across a wide range of tutorials and examples of how to create your own word embeddings. Many of these resources are great ways to get to grip with the topics. 

From a personal perspective, one thing I struggled with was understanding how well these embeddings performed given the underlying training data. For example, many examples use the works of Shakespeare to train their embeddings.  While I am a fan of “the Bard” it can be hard to know how closely related words like “love”, “family” and “war” should be after reading all of his works. 

Similarly, the most famous example from the original paper on Word2Vec is the relationship between “kings” and “Queens”. Instead, I think it will be better to create our own simplified language dataset to train our word embeddings. In this way, we can control the relationship between the vocabulary and better understand whether the embeddings are indeed capturing the relationship between the relevant words. 

Hopefully, this will be a dataset you can use and improve on to train mode advanced models like those that employ the attention architecture used by the latest NLP models like BERT and GPT-3. 

Also, we will utilize Neptune to track the progress of our embeddings and visualize their relationships. Again, you should then be able to use these techniques to easily track the progress of other NLP projects where you are training your own embeddings.  You can check out my public project on Neptune here to look at some of the results as you follow the code below.

The structure of the project will be as follows:

  1. Dataset: We will define our own simple linguistic dataset for the purposes of being able to understand the relationship between our “words”.
  2. Model: We will use the Word2Vec code from the TensorFlow tutorial series. There are many ways to use the Word2Vec such as in the gensim library. The benefit of the TF tutorial series is you get to see all the gory internals and can change anything you like to tinker and test the model.
  3. Experiment Tracking: We will set up Neptune to track the progress of the model during training. We can use it to track things like loss and accuracy but also show how the similarity between “words” is changing during training.
  4. Visualisation: We will also show how to understand the relationship between “words” in our toy dataset by viewing 2-D and 3-D visualisations in Neptune.
DL embeddings loss accuracy charts
Loss and accuracy charts available in Neptune

So let’s get started and create our dataset.

A perfect language

To better showcase how models like Word2Vec learn relationships between words, we need to create our own dataset. The goal is to create a dataset within which the proximity of words is perfectly defined. As an example think of any word that pops into your mind. 

The word “computer” for instance is a common word and has a relationship or “proximity” to many other words. “Computer” can, in common usage, be used interchangeably or in a similar context to PC, mac, IBM, laptop and desktop. It is often used in conjunction with the screen, program, operating system, software, hardware, and so on. It is related to things like professions when someone refers to a “computer” scientist or “computer” technician. 

But hey! There were people whose profession was called “computer”, who were responsible for complex calculations. The profession was mentioned in XVII century, when scientists hired “computers” to help them do calculations and Johannes Kepler was one of these computers before advancing with his own scientific projects. The profession was thriving to the end of II World War with introduction of ENIAC computer, when human computers became first programmers. 

Considering all the above, there is much more humanity in computers one can expect. 

We could go on and on here and the specific relationships are often domain-specific. If you worked in healthcare there may be a different relationship between terms like “computer” than if you worked in IT or as a software engineer. 

What’s the point here? Well, it can be difficult to know how similar “computer” should be to these terms. Should it be more similar to “operating system” than to “laptop” or “scientist”. Without knowing this we will be forever talking about Word2Vec and its “Kings” and “Queens”. Instead, let’s create our own simple language so we can:

  1. Make the vocabulary size small so that it can be easy to train quickly
  2. Define the length of each sentence in our test dataset so we don’t need any padding or added complexity of sentences of different lengths.
  3. Define a strict relationship between the “words” in our dataset so we know exactly what the hierarchy should be
  4. Not include any conjunctions or exceptions or any of that messy irregular linguistic magic that makes learning a new language so difficult!

The easiest way I can think of doing this is with a simple vocabulary where we have 26 three letter words based on the alphabet, e.g. “aaa”, “bbb”, “ccc” and so on.

Then we can define the “proximity” we want between these “words”. “aaa” *there are no capitals in this simple language!) for example, should only appear next to either “bbb” or “ccc”. “bbb” should only appear next to either “ccc” or “ddd” and so on. In this way when we look at the similarity between words we would like to see that “aaa” is always close to “bbb” and “ccc” in terms of similarity but should be very far away from “xxx” or “zzz”. 

The beauty with this is you can start to make your own language as simple or complex as you like. You can start including more complex relationships or conjunction words and so on. Include a word similar to the usage of “and” so that it appears next to any word in your new language. Then look at how similar it is to different words. For now we will create some simple examples to get you started.

A random language

Let’s start with a random language dataset where we would not expect any relationship between the words in your new language. You can find all the relevant code for generating your own language here

import string
from random import choice

alpha_d = dict.fromkeys(string.ascii_lowercase, 0)

vocab = {}
for c, l in enumerate(alpha_d, 0):
    vocab[c] = l*3

sen_num = 30000
sen_len = 7
vocab_len = 26

text_file = open("random_vocab.txt", "a")
for s in range(sen_num):
    sentence = []
    for w in range(0, sen_len):
        i = choice(range(0, vocab_len))
    text_file.write(" ".join(sentence))

A simple language

Now let’s create a simple language with some relationships. As we noted above, our language will have identical length sentences where the “words” will always occur in close proximity to each other. It’s like a simple extension of the alphabet where “aaa”s will always be close to “bbb”s and “ccc”s.

import string
from random import choice

alpha_d = dict.fromkeys(string.ascii_lowercase, 0)

vocab = {}
for c, l in enumerate(alpha_d, 0):
    vocab[c] = l*3

sen_num = 30000
sen_len = 7
vocab_len = 26

prev_word = None
text_file = open("proximity_vocab.txt", "a")
for s in range(sen_num):
    sentence = []
    start_word = choice(range(0, 18))
    for w in range(0, sen_len):
        i = choice([x for x in range(start_word+w+0, start_word+w+3) if x not in [prev_word]])
        prev_word = i
    text_file.write(" ".join(sentence))

Now let’s get the Word2Vec code from the TensorFlow tutorial series to use to train our dataset and then we will be able to track and visualize the results via Neptune.

Word2Vec Model

The TensorFlow tutorial series has a text section which includes a number of examples such as BERT and Word2Vec. As we noted earlier we will use the Word2Vec model to show how to track your experiment using Neptune to log accuracy and loss data but also how the similarity between words is changing and the 3D visualisations. 

We will use a stripped-down version of the Word2Vec code so that it is easy to run. The original contains far more information so make sure to check that out. It is well worth a read. The great thing about this code is you can modify nearly every available parameter. And with the tracking, we are setting up you will be able to monitor and compare those changes. 
The Word2Vec model will read in the simple dataset we created and then create training examples based on the Skip-Gram approach used in Word2Vec. This approach attempts to predict the context of a word given the word itself.

DL embeddings

The Skip-Gram approach tries to predict the words surrounding it given the word itself

 You can change all the parameters such as the window_size which is how many context or neighbouring words you are trying to predict at each step. 

The code examples below can be found in the one notebook here.

# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  # Elements of each training example are appended to these lists.
  targets, contexts, labels = [], [], []

  # Build the sampling table for vocab_size tokens.
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Iterate over all sequences (sentences) in dataset.
  for sequence in tqdm.tqdm(sequences):

    # Generate positive skip-gram pairs for a sequence (sentence).
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(

    # Iterate over each positive skip-gram pair to produce training examples 
    # with positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(
          tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(

      # Build context and label vectors (for one target word)
      negative_sampling_candidates = tf.expand_dims(
          negative_sampling_candidates, 1)

      context = tf.concat([context_class, negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      # Append each element from the training example to global lists.

  return targets, contexts, labels

The code you need to run is available in a notebook which you can use, or as this code is available on the TensorFlow website you can grab it straight from there and start coding and copy in the code you need. 

Experiment tracking

It can be difficult to identify the changes that occur within your word embeddings. Ideally, we would want to be able to see how the similarity of our words changes as training progresses. We can track things like the loss and accuracy for each epoch but it would be great to track the actual similarity between our words. This is a good example of how we can use the callback functionality of TensorFlow. This will enable us to log exactly what we want and then to track it in a Neptune experiment. 

To set up the Neptune experiment you just need to set up your credentials and initialize your Neptune module. Then all you need to do is start you experiment and you are ready to go!

import neptune
from neptune.types import Files

run = neptune.init_run(project='choran/sandbox',
             api_token='your token',)

This should output the URL for your experiment -><user>/sandbox/e/SAN-1

To check the similarity between our word embeddings we can compare the similarity for one example word and track how that changes between epoch.

def get_similarity(word, X, X_vocab, vocab):
    # Create a dict to find word index
    vocab_d = {}
    for i, w in enumerate(X_vocab):
        vocab_d[w] = i

    # Get the similarity for the word
    y = X[vocab_d[word]].reshape(1, -1)
    res = cosine_similarity(X, y)
    df = pd.DataFrame(columns=['word', 'sim'])
    df['word'], df['sim']= vocab[1:], res

Then all we need to do is call this function when with our custom callback function. You can read more about the different callback features which you can use here. And for some examples of a custom callback check out the examples in TensorFlow on writing your own callbacks here

You can log your results in the form of a Pandas DataFrame and then save that to your experiment and track how it changes over time.

DL embeddings artifacts

This shows the cosine similarity between ‘bbb’ and other words after 900 epochs. Correctly it looks close to ‘aaa’ and ‘ccc’. But how has that changed over time and how does it compare to the random dataset or a larger dataset?


ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

3D Visualisation

For our 3D visualisation we can use plotly and PCA to save a 3D plot of our embeddings and visually compare their relationship between each other. For example, look at the difference between our “random” dataset and our more “structured” dataset.

DL embeddings 3D visualization
We can see here that the clustering of our “words” seems fairly chaotic?
DL embeddings 3D visualization
Whereas here you can see a “nicer” relationship between our “words”, it seems far more structured and there seems to be a pattern you can investigate by interacting with the visualisation.

You can also see how simple our callback function is to log and track this info to Neptune.

class TrackLossAndSimilarity(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        # Log the similarity to an example word at regular intervals
        if epoch%100 == 0:
            vocab = vectorize_layer.get_vocabulary()
            X, X_vocab = get_data(word2vec.get_layer('w2v_embedding').get_weights()[0],
            check_word = 'bbb'
            sim_df = get_similarity(check_word, X, X_vocab, vocab)
            sim_fig = get_3d_viz(X, X_vocab)
            log_chart(f'3d-plot-epoch{epoch}', sim_fig)
            log_table(f'similarity-{check_word}-epoch{epoch}', sim_df.sort_values('sim', ascending=False))


The Best Tools For Machine Learning Model Visualization

Summary and next steps

To close this out let’s review what we have just done here and look at how you can use to train and test mode advanced models. 

  1. We created our own “linguistic” dataset: The goal here was to create a dataset within which we could know (and design) the relationship between terms. That way when we create embeddings and check their similarity we would know if `aaa` was more similar to `mmm` than `ccc` there was something wrong as the embeddings were not correctly picking up the relationship we knew existed. 
  2. We found a model to create embeddings: We used some example code for the Word2Vec model to help us understand how to create tokens for the input text and used the skip-gram method to learn word embeddings without needing a supervised dataset. The output of this model was an embedding for each term in our dataset. These are “static” embedding since there is one embedding per term.
  3. We explored the embeddings: The ultimate goal was to understand how a model can encode meaning into an embedding. So we looked at what information the embeddings contained when trained on a random dataset and compared that to a structured dataset where terms were only used in “proximity” to certain other terms. We did this by tracking the similarity between example terms and showing how that changed over time and differed between the two experiments. And we also showed this by visualizing the relationship between all embeddings. In our simple example, we were able to see a difference between the random and the proximity dataset

How can I use this to better understand models like BERT?

Remember, no matter how advanced and complicated models like BERT may seem, they are ultimately just encoding information into an embedding like we did here. The difference is that the people who designed the Transformer architecture (which is the basis for most new NLP models like BERT and GPT-3) found a “better” way to encode information into embeddings. The embeddings we used generated one embedding per term. 

But, as we know, real language is messy and the same words can have different meanings depending on context. This is a blog post, but you can post mail, in basketball you can post up, in soccer you can hit the goal post, you can display information in public by posting it on the door and so on. This is a problem for static embeddings like we just created but the Transformer architecture tries to address this by creating embeddings based on “context”. So it produces different embeddings based on the context in which the word is used. 

You can test this, for example, by creating a new dataset in which one term is used in proximity to very different terms. The term `ccc` could be close to `bbb` and `ddd` as currently designed but also you could add code to ensure it is also close to `xxx` and `yyy`. How would you think the similarity values for `ccc` would work in this case? Would they show equal similarity to `ddd` and `xxx` for example?

The key here is that you can apply the same techniques here to try and understand how other models work. You could train a BERT model on this dataset and try and see what is the difference between static embeddings and context-based embeddings. Here are some resources you can use if you want to do something like that:

  • Train BERT from scratch on Esperanto: This guide from HugglingFace shows you how to train a BERT model from scratch on a new language. You can try and train this on your custom dataset or some more complicated variation of it. Then try and track the embeddings like we did for Word2Vec. Remember, BERT embeddings are context-based so you won’t have a lookup dictionary like you have with Word2Vec. You will feed it an input sentence and get back embeddings for each word or a “pooled” embedding for the entire sequence. Find out more here.  
  • Start experimenting with sentence embeddings: Instead of word embeddings you can embed entire sentences and compare their similarity. You could play around with this by using a simple custom dataset and then getting the average embedding for your input sequence and using that to compare entire sequences of data and identify their similarity. You can find some examples of these models here.
  • Context v static embeddings: Do we really need context embeddings? How do context embeddings differ from static embeddings? And can we generate better static embeddings from context embeddings? Check out this great post if you want to find answers to some of these questions. 

Was the article useful?

Thank you for your feedback!