AI Limits: Can Deep Learning Models Like BERT Ever Understand Language?

Posted November 11, 2020
AI limits

It’s safe to assume a topic can be considered mainstream when it is the basis for an opinion piece in the Guardian. What is unusual is when that topic is a fairly niche area that involves applying Deep Learning techniques to develop natural language models. What is even more unusual is when one of those models (GPT-3) wrote the article itself!

Understandably, this caused a flurry of apocalyptic terminator-esque social media buzz (and some criticisms of the Guardian for being misleading about GPT-3’s ability). 

For a domain like NLP it is a rare and unexpected time to be front and centre of the “Artificial Intelligence (AI) v Human beings” debate. This burden usually falls on robots (most recently self-driving cars) since it’s easier to imagine being run over or attacked by an AI-powered mechanical monster. NLP models that can generate text seemed, initially, somewhat less scary than a red-eyed Terminator. 

Nevertheless, the rapid progress made in recent years in this field has resulted in Language Models (LMs) like GPT-3. Many claim that these LMs understand language due to their ability to write Guardian opinion pieces, generate React code, or perform a series of other impressive tasks

But how well do these models really perform on simple NLP tasks? 

Is there really any evidence that they “understand” what they’re talking about? 

Did the GPT-3 model writing the Guardian article “understand” what it was saying? 

Could it defend the piece like a human? 

How good is it at learning new tasks?

Even from a purely practical business perspective, it is important to understand the potential limits of these models.

  • If they really are as good as the hype indicates, then it is vital that your business starts to adopt these technologies immediately, since they will have even more transformative impact than technologies such as the telegram, electricity or the railway. 
  • Conversely, if they are over-hyped then it may change how you view and use these models in your future plans.

To understand NLP, we need to look at three aspects of these Language Models:

  • Conceptual limits: Is it possible to understand language by reading lots and lots of text? If we try to understand how humans learn and use language, it seems there may be implicit limits to how much a machine can learn from text alone.
  • Technical limits: Even if it is possible for these models to develop human-like skills in language tasks, are the current models the right ones for the job? Does the underlying architecture of these models make it impossible for them to realize their full potential?
  • Evaluation limits: Maybe the problem is simply that we do not have the capability to properly evaluate these models? Is the current hype related to the fact that the NLP tasks we use to test these models are outdated and too simple given the rapid advances seen recently in the field?

Conceptual limits: What can we learn from text?

The big problem with training any DL model is data. You typically need lots of it. How much? The more the better, and this is the trend that most recent LMs have followed. 

Without going too much into the design specifics of each model (we’ll do that in the next section), we can think of the general approach as being able to understand language by reading more and more text. 

The key is that text does not need to be labelled. Instead, these models can read a book or a blog post, and try to understand the meaning of words within the context in which they’re used. For example, the term “deep learning” will be used mostly in relation to things like “machine learning”, or “neural networks” or “artificial intelligence”. So the models will start to see these terms as having a somewhat related context. With more and more data they will start to learn more nuances in terms or the different usage and meaning between these related terms. At least that is the theory.

As an example of the amount of data needed, let’s take BERT. Published in 2018, one of the most influential models in recent years, combined 2.8 billion words of Wikipedia data with 800 million words of book corpus data, and used 340 million parameters. 

GPT-2 (the model that was too dangerous to publish) followed BERT in early 2019 and was trained on 8 million web pages (~40 GB text data) and contained 1.5 billion parameters. For comparison, the most recent version of OpenAIs GPT (the Guardian writing model), GPT-3, contains a whopping 175 billion parameters, and was trained on a total dataset of 45TB from a wide array of different text sources. 

GPT-3

Models like GPT-3 show that performance on certain tasks improves with more parameters (and, in this example, with more task demonstrations or instructions; the zero, one or few shot in the diagram). But does this mean these models are beginning to “understand” language?
Source: GPT-3 Paper

From a high level it’s easy to see the trend here: create models with more parameters, get them to consume more and more text data, and the models will begin to “understand” language at a human level. 

The evidence shows that this approach seems to be working. GPT-3 appears to be one of the most advanced models, it can perform well in a wide range of different language tasks without requiring much further training. 

However, a recent paper raised some interesting concerns about the feasibility of this approach.

Do they pass the octopus test?

In their paper “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”, Emily Bender and Alexander Koller consider whether LMs such as GPT-3 or BERT can ever learn to “understand” language – no matter how much text they have access to or how many parameters they have available to process that information. The key issue they raise is the relationship between form and meaning. 

According to their paper, form is the identifiable, physical part of a language. The marks and symbols that represent the language, such as the symbols on a page or the pixels and bytes on a webpage. 

The meaning is how these elements relate to things in the external world. It’s important to note here that the authors assume the models in question are only trained with text, and not trained with any combination of text and images, or other elements representing the external world. In this sense, like GPT-3 and BERT, these models are trying to learn meaning from form alone. 

Think of it like the Searle Chinese room experiment (which the authors reference in the paper), learning from form alone would be like trying to communicate in a language that you know nothing about, given textbooks and dictionaries written only in this strange language. This is similar to what the current LMs are trying to do by looking through mountains of textual data.

NLP limitations

Could a machine learn the intent of the Napolean pose from form alone? Locked in a room, reading only text could the machine identify that the statement referred to a particular type of pose?

What does this all have to do with octopuses, you may ask? 

The octopus test is an interesting thought experiment used in the paper to show how the current LMs will never be able to truly “understand” language. You should check out the paper for a more detailed description of the experiment, it’s a great example of the power of thought experiments and a nice change from DL papers full of crazy looking equations. But here’s the gist of it:

Imagine an octopus, O, placed between two people, A and B, both stranded on remote desert islands, who can only communicate via an underwater telegram-like system. The octopus, like the LMs, can listen in on the conversations between A and B. Imagine they do this over a long enough time to say almost every possible word, phrase or sentences that A and B are capable of using. Could O ever communicate with A or B in a way that shows O “understands” what they’re talking about?

NLP octopus visualization

We can easily imagine scenarios where trivial conversations between O and A or B would look like perfectly valid and reasonable conversations. Neither A nor B would know they’re talking to an octopus. GPT-3 seems to be able to do precisely this – communicate with people in a human-like fashion. 

However, this only works up to a point. Imagine a different task, where A or B ask O to build an important item (like a coconut catapult), report back on how it works, and to offer suggestions for potential improvements. We can start to see here that O has no way to “understand” how to build this, or what the items needed even look like. There is no connection between the form and the meaning. 

Similarly, as the nature of the tasks changes, the connection between meaning and form becomes ever more important, and this is precisely where O will start to show linguistic limitations.

When we imagine these scenarios, it’s not difficult to come up with tasks where LMs such as BERT or GPT-3 would struggle to “understand” what they’re saying since they lack the link between form and meaning. They are chaining things together like a jigsaw, identifying patterns they have seen in the past. But they don’t really understand what they’re doing, or why.

People may claim this is not important for many NLP tasks, or that we don’t really need to care whether these models “understand” things, but whether they’re able to perform human-like tasks. Maybe it’s a purely academic discussion, of no relevance to whether these models are useful in a business sense? 

Even if we assume these models can learn enough from form alone to perform at a near-human level, that may still not be enough. If the core architecture of these models limits their ability to learn even form alone, then it’s a moot point whether they “understand ” what they’re saying. And that is what we will look at next.

Technical limits: Are LMs “cheating”?

They’re not taking performance enhancing drugs, but it’s possible that LM models like BERT and GPT-3 can gain an “unfair” edge. To understand how this is possible, we need to dive into the detail of the underlying architecture of models like BERT; the Transformer architecture. 

It is this architecture that claims to help LMs learn “context” from the vast datasets of text on which they’re trained. But what if it looks like it is not really learning “context” at all? What if LMs find cues hidden in the text data? 

Using these cues, a Learning Model could perform well on a particular task, such as question and answer, entity recognition or sentiment analysis, but it would actually have very limited linguistic insight. 

The issue arises when we make subtle changes to the underlying text data – changes that would not impact a human’s performance, but render LMs like BERT virtually “speechless”. If this is the case, then these models may struggle to even learn key parts of form or language semantics, which they will need to excel at many important linguistic tasks.

NLP cues

Table from a paper published in 2012 which created a rule-based structure that uses cues, such as negation, to perform better in NLP tasks. This is an example of “cues” that LMs can use to “cheat” at NLP tasks. But if Transformers are using these simple mechanisms, then it may raise questions about its potential to “understand” more complex aspects of human language. 

The science of Context

On May 6th, US President Donald Trump, during an event at the Oval Office, stated the following:

“This is really the worst attack we’ve ever had. This is worse than Pearl Harbor. This is worse than the World Trade Center. There’s never been an attack like this. And it should have never happened.” 

What was he talking about? A new war? A new terrorist attack? Maybe his next remark helps clarify it a little: 

“It could have been stopped at the source. It could have been stopped in China. It should have been stopped right at the source, and it wasn’t.” 

Still not clear? Well, the President’s remarks in an earlier paragraph should provide the clarity needed:

“…this virus is going to disappear. It’s a question of when. Will it come back in a small way? Will it come back in a fairly large way? But we know how to deal with it now much better.”

The key takeaway here is that context is important. If you cannot remember the earlier paragraph, then you would have very little idea the President was talking about the Coronavirus pandemic. Language is tricky, it can be messy, and in any linguistic setting we need to constantly update our “cache” of context so that we can infer meaning from the words we are processing.

nlp context

Spilling coffee is “messy” from both a linguistic and a practical sense! The verb “spill” can have different meaning depending on the context, so we constantly need to update our context “cache” to know what “spill” is referring to in each scenario. Example is from the book “The Reading Mind: A Cognitive Approach to Understanding How the Mind Reads” by Daniel T. Willingham

In 2017 a new paper was published, “Attention is all you need”, which changed the DL NLP landscape forever. It’s still at the forefront of any headline you read about a new model shattering performance benchmarks in an NLP task. 

One of the main reasons for this is that the neural network design, known as the Transformer, allowed for models to more easily capture context when parsing text. This used to be a difficult task since the underlying architectures parsed text in a sequential manner.

This meant that it had to be word by word, sentence by sentence, so it was very slow to train on a large text corpus. And secondly, it meant that maintaining any form of long term context was computationally very expensive. In reference to the President’s comments above, it would have been difficult for these models to store the earlier context that the “attack” referred to the Coronavirus. 

Without getting into too much detail, the Transformer architecture uses key, query, and value parameters which enable it to know which part of the text is the most relevant in this specific context. Think of the classic “river bank” v “money bank” scenario. In earlier models, a word’s meaning was static, it didn’t change based on context. So the word “bank” would have the same meaning in both “I left my fishing rod by the river bank” and “I lodged my money in the bank today”. Similarly, without context President Trump’s comments could read very differently.

The Transformer in all it’s beauty. Don’t worry if this diagram scares you, It’s not important to know exactly how it all works for now. The interesting thing to take away from this diagram is that the Transformer is really made up of two components, the encoder and the decoder. Different models will use different parts of this architecture. BERT, for example, uses the encoder part, while the GPT models use the decoder part. Source: Attention is all you need

It’s all about Attention

The Transformer architecture leveraged a mechanism known as “attention” to address the context problem in NLP. Attention has been used many times before by other neural networks, but the unique aspect of the Transformer was that it only used attention to learn from text (hence the paper title “Attention is all you need”). 

Previous models used attention as part of their approach, usually in a minor way. The Transformer really doubled down on the idea of attention and packed together individual attention elements, known as “attention heads”, to form multi-head attention modules. It then staked these multi-head modules together to form attention layers. 

In simple terms, think of an attention head as being able to “focus” on a word (or part of a word), and it can tell the model how relevant or important that word is to understand the current word being parsed. 

More attention heads mean your model can look back (or forward) at more words in a sentence or paragraph. More layers of attention mean that your model can then learn higher levels of both syntactic structure and semantic meaning. 

NLP attention heads

The bold lines indicate words which the attention heads identify as being more relevant to the meaning of the word currently being processed. So “rag” and “coffee” influence the meaning of “spilled” more than “and” or “get”.  

Any neural network is basically a lot of matrix multiplications, and the attention mechanism is no different here. The table below shows a toy example of what the output for the word “spilled” from the attention layers might look like. 

It does this by multiplying different weight matrices together, each of which “learns” a weight to try and identify which words in the sentence the network should “pay attention to” as they are important to the context of that particular word. 

Trisha spilled her coffee and Dan jumped up to get a rag
Trisha
spilled 0.05 0.6 0.0 0.15 0.0 0.0 0.05 0.0 0.0 0.0 0.0 0.15
her
coffee
and

The final vector which represents the word “spilled” is made up from the weights of all the other words as shown in the table. Note the original results are put through a softmax function so they all add up to 1. Thus, the final vector representing “spilled” will be made up, mostly of its own meaning, but also some of the meaning or weights for the vectors representing “Trisha” and “coffee” and so on. If this was a static word embedding mode like “Word2Vec” then the “spilled” word would be 100% from its own single meaning, i.e. it does not use any context at all.

I know, this was a bit of a whirlwind overview of the Transformer architecture (a great resource to understand it in more detail is Jay Alammar’s excellent blog post on the topic). But it leaves us with two key assumptions we can make, which would support the claim that LMs can “understand” language:

  1. Models like BERT and GPT-3 use the attention mechanisms of the Transformer architecture to learn context from textual based datasets. 
  2. By learning context these models develop some level of language “skills” which enables them to perform better on a range of language tasks.

But if we can show that there are doubts relating to both of these assumptions, then it seems difficult to claim that these models are capable of developing any ability to “understand” language.

BERTology – What does BERT learn?

In their 2019 paper, “Revealing the Dark Secrets of BERT”, the authors delve deep into the inner workings of BERT. One of their key findings is the BERT is massively overparameterized. 

The authors investigated this by disabling one or more attention heads, and then comparing the results. What they found was surprising – not only did removing attention heads not impact the results, but in some cases it improved the performance. 

It should be noted, this was not the case on every NLP task, and the removal of some attention heads did negatively impact performance. But this occurred in enough cases for the authors to raise questions about the relevance of so many attention heads in BERT.

NLP attention heads

Experiments have shown that you can remove individual heads from an attention layer in BERT and it will perform the same or better on certain tasks. Even more surprising, they show you can remove entire layers, i.e. all the attention heads and it does not severely impact performance. 

This raises some important questions. How can these models learn the complexities and nuances of language via a small number of attention heads? Are the other heads simply storing information that they use at a later point, rather than learning rules and structures via context? 

You could claim this means that attention is so powerful, BERT is able to perform well on NLP tasks by using only a small amount of its potential. 

In the next section, we will look at this claim in more detail, since that is also related to the structure of the evaluation datasets. At the very least, these findings make us question whether simply loading up more and more attention heads will lead to models that “understand” language. 

Instead, we may need to look at pruning and re-designing these networks if we want to develop models that truly understand language. As evidence of this, the OpenAI team behind the parameter behemoth that is GPT-3 note in their own paper that we may be hitting the limit of what Language Models learn from more parameters and more training. 

A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective.”

Are these models cheating?

And what about the cheating? This relates to our claim that these models are capable of learning something about language, via context, which helps them perform better in NLP tasks. 

By being able to parse different sentences, look at all the words, and identify the important ones in context-specific ways, these models should be able to identify that Trump is talking about the Coronavirus and not a terrorist attack in our earlier example. This would help them perform well on a range of NLP tasks that were previously beyond NLP model capacity. 

A series of recent papers make the claim that models like BERT don’t really understand the language in any meaningful way. They show this in a creative way, by changing some of the evaluation datasets and then looking at the results. First, they analyze datasets on which models like BERT have performed so well that they outperform humans at the task. They then alter these datasets in a way that would make no difference to how the results are interpreted. 

For example, they identified that many phrases in the datasets contain negations such as “not”, “will not” or “cannot”. Using simple rules to “key” off these identifiers would result in high overall scores. The paper authors then altered these datasets so that they removed these “cues” while maintaining the overall structure of the dataset. 

To a human, or anyone that “properly” reasoned about the task originally, their scores should not vary significantly. It’s the equivalent of saying:

“it is not raining therefore I can go for a run”, 

and 

“it is raining therefore I cannot go for a run”,

i.e. we change the initial premise but it should not make the task any more difficult to infer the correct answer. 

If we’re not “cheating”, and we understand that we cannot run in the rain, then we should make the correct inference in both cases. The fact that one is negated should not cause us to make a false inference, but this is exactly what BERT does. Instead of performing at a human-like level, it immediately dropped to nothing better than random performance. 

Now we know that BERT:

  • doesn’t use all its attention heads to learn from context, 
  • does not seem to use what it has learned to “reason” or “understand” language 
  • it seems to use statistical “cues”, like negation terms “not” and “cannot”, as a crude heuristic to get better results. 

Can we blame the models themselves, or does the fault lie with the testers?

Evaluation limits: How good are models like BERT?

So far we have considered a broad philosophical question: can current Deep Learning NLP models learn to understand language via text alone? 

Even if we assume these models can potentially learn a high level of linguistic knowledge from text alone, we looked at the internal structures underpinning the new Transformer architecture –  the key to the latest advances. We showed that there are questions about whether these models can scale to a level where they would be capable of developing human-like linguistic knowledge.

All of this was based on the assumption that we can somehow test these models to see how well they perform. We assume that there are datasets and benchmarks which will tell us whether these models really have learned transferable, human-like language skills. 

We’ve already seen that models like BERT can “cheat” on some tests, but is this an outlier or are modern NLP datasets easy picking for the suite of current DL LMs? If there are simple tricks that these models can use to get high scores, then we will struggle to know whether these models are really improving their linguistic abilities.

Asking the right questions

There are many NLP tasks on which a model can be evaluated. Some, such as Named Entity Recognition (NER) and Parts-Of-Speech (POS), look at the ability of a model to understand the syntactic and hierarchical structure of language. 

They represent the core parts of a language, the foundation on which the higher level of semantics evolves. If we want to claim that the new Language Models understand language, then we want to see how they perform on more complex, higher level tasks such as Question-and-Answer (Q&A). 

Here, a model needs to understand things like context, inference, and semantic similarity. As we noted earlier, models like BERT have performed at a human like level on a number of these higher level complex tasks. 

But we have also seen that these models can cheat. So are these models simply improving at a pace faster than the benchmarks can keep up with, or are they showing real signs of linguistic knowledge?

A new dataset released by Google is a good example of how we need to develop new benchmarks and avoid the pitfalls of previous approaches. The Natural Question (NQs) dataset is a Q&A dataset which aims to evaluate how well an LM can understand a question and parse a page of text, such as a Wikipedia page, to find a potential answer. 

What’s interesting about this dataset are the measures the authors took to make it difficult for LMs to cheat. These measures show how earlier benchmarks and datasets may have made it easy for models like BERT to cheat.

The first step the authors took was to ensure that the questions they chose were “real” questions. “Real” in the sense that these questions were asked by people in Google searches. They were reviewed to make sure they were well-formed, reasonably long and coherent. 

Previously, Q&A datasets like SQuAD would have asked contributors to create questions for a given answer. So, given a piece of text, create a question for which this paragraph represents the answer. This can lead to “priming”, which is where the contributor would see the answer first and create questions that closely resemble the answer. This makes it easy for the model to use “cues” to find an answer. 

After choosing the questions, NQs contributors were given a page of text and asked to identify:

  • a long answer, 
  • a short answer, 
  • or if it was not possible to find an answer in the given text at all. 

In some cases, an answer may have both a long answer which covers every aspect of the relevant question and a short answer which answers the question succinctly. 

The short answer is a short text which includes one or more named entities. The option of not including an answer is another key differentiating step for the NQ dataset. Earlier Q&A datasets, including the first version of SQuAD, only included questions for which there were corresponding answers. 

Once a model starts to know that there is always an answer, then it can use this kind of information to find an answer without really testing its higher-level language skills. 

NLP text

Better benchmarks, better models

Luckily, the NLP community seems to accept that we need to put as much effort into creating datasets and benchmarks as we do into creating the Language Models themselves. 

There are a number of recent papers focused on identifying how models like BERT can exploit weaknesses in some classic NLP datasets. 

For example, in “Right for the Wrong Reasons”, the authors identify three ways in which LMs can get high scores on NLP tasks without really understanding the basic rules of language. They identify three heuristics these models use, which show their lack of understanding (but which can still result in high scores due to poorly constructed datasets):

  1. Lexical Overlap: assuming “the doctor was paid by the actor” is the same as “the doctor paid the actor”,
  2. Subsequence: assuming “the doctor near the actor danced” is the same as “the actor danced”, 
  3. Constituent: assuming “if the artist slept, the actor ran” is the same as “the artist slept”

As a result of this research, we are seeing better benchmarks like SuperGLUE and XTREME, on which it’s difficult for models like BERT to achieve human-like results. These advances are as important as advances in model technology and will force these models to “work harder” to achieve high scores. 

So how good are these models? 

It might seem strange – we looked at the theoretical, technical and evaluation limits of LMs, and now we’ll praise their accomplishments. 

The thing is, by asking questions we are speculating about the ultimate potential of these models. And this is a high bar, since we are considering whether these models will ever gain a form of general AI where they can learn new tasks without further training, build on their current linguistic skills and communicate with humans in a way that shows they understand what they’re talking about. This is heady stuff, cue the terminator like gifs.

The key thing to clarify is that while we may question their ability to understand language like a human being, there is very little doubt that current Transformer models like BERT have pushed the frontiers of DL NLP further and faster than anyone would have predicted even four of five years ago. The fact that these models can “cheat” and seem to be using only a tiny portion of their attention heads to perform well in NLP tasks shows how far they have come. 

But this also raises the danger of over-hyping these models. Maybe they will never attain a level of language understanding similar to you or me. Maybe they don’t need to. 

Maybe these models just need to develop more statistical cues, and they will be good enough to transform the business landscape with a range of chatbot and automated NLP applications that change the way we search and use information. They might not be good enough to win a Pulitzer anytime soon, but we are still only scratching the surface of their untapped potential.

Machine Learning Engineer at Intercom