MLOps Blog

Transformer NLP Models (Meena and LaMDA): Are They “Sentient” and What Does It Mean for Open-Domain Chatbots?

16 min
2nd August, 2023

First of all, this is not a post about whether Google’s latest Deep Learning Natural Language Processing (NLP) model LaMDA is the real-life version of Hal-9000, the sentient Artificial Intelligence (AI) computer in 2001: A Space Odyssey. This is not to say that it is a pointless question to ask. Quite opposite, it is a discussion that will likely dominate much of the future research in AI. 

Right now, however, a more pressing question is what these claims mean for the current state of NLP.

  • For example, what are the key technical differences between these models and their earlier ancestors?
  • Can these models be used for different NLP tasks?
  • If yes, then what data is needed to train these models?

These questions are now no longer debated exclusively in Machine Learning (ML) blogs, thanks to the rapid progress achieved by earlier NLP models such as BERT. Instead, they are now driving clicks in non-technical newspapers such as The Economist. This means that more people will see and hear about these models. As a result, it is more important than ever to understand the technological innovations driving this latest round of developments.

The true impact of “sentient” language models

How well can we expect bots to work in the future
The latest neural network models have raised questions about just how well we can expect bots to work in the near future | Source

As it seems widely accepted, LaMDA is not sentient, but the mere fact that this question is being raised shows how far the Transformer technology, the deep learning architecture that underpins much of the recent progress in NLP, has pushed the capabilities of chatbots in a relatively short span of time. This means the discussion is no longer the sole realm of ML specialists and has much broader implications.

Subsequently, it also means that expectations may be misaligned. If chatbots are “almost sentient” now, someone not really familiar with the technology may believe that ML applications such as question and answering, summary, text generation, and semantic search have now been solved comprehensively. This train of thought may take us ahead of ourselves.

To clear the air around this debate, we need to understand the recent advances and what they mean in terms of the current and future capabilities of these models.

  • Are these latest advances only relevant to academic institutions and mega-corporations like Google?
  • Or are there any tangible benefits that small and medium businesses can utilize right now by adopting some of these latest developments?

To address these questions, we will take a look at the chain of events that led us here, the current state of hype around such models, and simultaneously, we will also discuss the practical aspects of these advancements. So let’s begin with identifying the core problem that each of these models is trying to solve.

Read also

Can GPT-3 or BERT Ever Understand Language?⁠—The Limits of Deep Learning Language Models

Building Machine Learning Chatbots: Choose the Right Platform and Applications

Having a conversation is difficult!

Difficulties in conversation
The nuances of a back-and-forth dialogue are varied and complex | Source

We rarely think about the many complexities involved in a simple conversation. Whether it is with someone you barely know, a close friend or relative, or a customer service agent, the nuances of a back-and-forth dialogue are varied and complex. Depending on the situation, your conversation can change from topic to topic on a whim, employ metaphors, jokes, or irony, assume certain common sense knowledge or specify external, verifiable facts. In short, there is a lot going on!

This is why, until recently, it was not clear whether an “end-to-end” neural network such as the large Language Models (LMs) built on top of the Transformer architecture could be trained to perform such a task. 

In these models, you pass them lots and lots and LOTS of textual data, and then they output either more text or a large dense vector (embedding). In either case, the output can represent:

  • an answer to a question,
  • a summary of the input text,
  • or provide a similarity score based on the input,

to name a few everyday use cases, but none of these experiments really came close to solving the problem of creating open-ended dialogue models.

May interest you

Transformer Models for Textual Data Prediction

Google’s Meena: The arrival of the open-ended dialogue models

Chatbots in competitions like the “Turing Test”, where contestants attempt to convince human judges that they are not, in fact, speaking to a bot, tend to be made from a combination of rule-based components and machine learning algorithms. That is why the paper “Towards a Human-like Open-Domain Chatbot”, published by google in 2020, was a milestone.

In it, the authors claim that their proposed model, Meena, by achieving State Of The Art performance on a wide range of NLP tasks and human evaluation tests, had answered the open research question – that a large end-to-end model, without any hard-coded rules, can generate almost humanlike chat response in an open-domain setting.

That is a very big claim, and on face value, may seem like it just changed the game. But, a very basic follow-up question might be – how did they do it?

An example chat between Meena and a person
An example chat between Meena (left) and a person (right) | Source: Google blog

The authors basically created a new metric with which human evaluators can gauge whether Meena sounds more human or not, called the “Sensibleness and Specificity Average (SSA)”. This attempts to measure two things:

  1. Sensibleness: This seems obvious. Responses need to make sense in the context of the conversation. If you ask me, “What time is it?” and I say, “There are three Fjords in Ireland”, then that is not a sensible response. It seems like a random fact.
  1. Specificity: This is a measure of how well the response relates to the question. For example, in response to being asked what time it is, I could say, “I don’t know”. This is a sensible response, I might not know the time right now. But it is not specific. If, instead, I said something like, “I don’t know, my watch is charging, and I can’t find my phone” this is both sensible and specific, I am providing specific information pertaining to your current question.

What does it take to be human?

The Meena authors claim that one of the issues with the “Turing Test” type evaluations is that it looks mainly for sensible answers. To show this, the authors created a simple bot called GenericBot which only has two responses:

  1. I don’t know: It responds with this whenever the input is a question
  2. Ok: it responds with this whenever the input is a statement.

With these simple rules, the bot was able to generate high sensible scores when interacting with humans. It may be the world’s most boring chatbot (don’t get stuck with it at a party), but none of its responses could be said to be illogical or nonsensical. You might, however, consider it strange if these were the only responses you received in a conversation with another human being. 

This highlights an important contribution of the Meena paper, namely, what do we consider to be a humanlike response? 

Mechanics of the Meena language model

We will look at two particularly interesting technical aspects of the Meena chatbot model that determine the answer to the question asked above:

  1. Perplexity: Meena uses a new metric that aligns well with the human judgment of dialogue quality.
  2. Training data: Meena is trained on a large amount of dialogue data, this is different from previous models.
Meena use perplexity to improve human judgments of dialogues
Meena attempts to use perplexity as a way to improve human judgments of dialogues against their Sensibleness and Specificity Average (SSA) score. The dashed lines offer a baseline comparison for different approaches, and the dotted line is the regression line showing the relationship between perplexity (as measured by different Meena model sizes) and SSA scores. | Source

Perplexity: What comes next?

One problem with previous dialogue models was the question of how to properly train the models so that they produced results that correlated well with human evaluators. Previous research showed that benchmarks such as the BLUE benchmark, when used to evaluate dialogue response generation systems, correlated poorly with human judgments.  

This is important since if we cannot find an automatic metric to evaluate how well the model is performing during training, then all the data in the world won’t help us learn good responses. Meena tackles this problem by introducing the perplexity metric. Simply put, perplexity measures how confident the model is about predicting the next token. 

Remember, auto-regressive LMs generally take the previous tokens as inputs and try to predict the next token. These models will generate a probability distribution over possible next token responses. Perplexity, in this scenario, will predict how well the model is able to predict the correct next token (or, in Meena’s case, the next word in the conversation). 

Previous models struggled to align their automatic metrics with human evaluations. What Meena showed was that if you try and improve perplexity scores, i.e., a lower score is better for perplexity, then this correlates well with human ratings of the dialogue.

Dialogue training data

We noted that previous models such as BERT were based on word embeddings. The training data for BERT is organized so that the model can learn to generate an embedding that represents a word and its relevant context. This is important since we can feed the model lots of data from the web in any format once it can parse the data word by word, or more accurately, token by token. The model can then learn to predict masked words, and this has proven to lead to very impressive results. 

Building up from the word level, models like Sentence BERT are based on using sentences as inputs instead of single words. In these cases, the output of the models is not a word embedding but a sentence embedding. Sentences are more varied and complex than words, so this is still an area of active research

So imagine the extra complexity of trying to understand multi-turn conversational dialogues? Now we need to understand sentences and how to relate to previous sentences and predict the next topic or response or open question and keep in mind what has gone before. We cannot learn that nuance by just feeding a model some random sentences.

To address this, the Meena authors create training data from the web and social media. The first message in the data is treated like the root, and all responses are like child nodes. Each step or message on this tree is considered a turn. This allows Meena to use previous turns when trying to predict the next response. 

For Meena, the training data consists of context-response pairs where the context is the last number of turns up to a maximum of 7. Thus Meena is able to parse the previous turns and try and predict the response. It can then compare the predicted response with the actual response to learn to predict a better response in much the same way as BERT learns from comparing the actual word with the predicted word. This is where the perplexity score we mentioned earlier comes into play. A low perplexity means the model is more confident of the result, and this should correlate with a good response.

The authors also edit the dataset and remove any messages which may not be safe or contain offensive language or other factors which may limit their usefulness as a response, e.g., if the message is mostly made of numbers. The sub-tree under the blocked message is also then removed. The final dataset is a total of 341GB of text.

Utility of Meena for businesses and Machine Learning teams

Is it in sync with business requirements?

As we mentioned earlier, Meena is designed to specifically try and avoid saying things like “I don’t know” in a dialogue. “I don’t know” is not a “specific” enough response. However, in most business cases, this may be precisely what you want from your bot: 

Customer: “Hi, I am seeing the following error when I log onto my account, I have attached a screenshot”

Bot: “Thanks for the info, that’s super helpful, I haven’t seen that error before, and I have seen a lot of errors, would you like to see some of the other ones?

Customer: “Not really, can you just fix this issue as it seems to be a bug?”

Bot: “Do you know where the term software bug comes from? It is an interesting story originally it was when an insect flew into one of the world’s earliest computers”

Customer: “Can we just focus on this issue?”

This bot might be an interesting character, but that is not what you need when you are trying to solve your issue. If the bot cannot answer the customer’s query, you may want to escalate that interaction to a human. You do not want your bot trying to be interesting and entertaining to your solution-seeking customer. The academic goal of creating an open-ended bot may be very different from the requirements of a business bot. This will differ if businesses want to deploy bots that the customer thinks are human. 

In most cases, however, businesses do want the customer to know when they are talking to a bot and when they are talking to a human. If you would like the customer to be unable to discern between the two, then an open-ended bot may be what you want. The key thing to note here is that the goal of creating Meena may not be aligned with the business problems you are trying to solve with a bot, as most businesses want their bot to know something about their specific business domain.

Interesting findings for ML teams

While Meena may not be something that can be used “out-of-the-box” to address specific NLP tasks, it does have a number of interesting technical aspects which might be of interest to different ML applications:

  1. The identification that perplexity correlates well with human evaluations is useful. If, for example, you are thinking about building your own generative dialogue bots, then perplexity is a metric that is easily available to you to train your own model on your own data. If it does correlate well with human judgments then you could potentially avoid expensively labeled data and instead use this automatic metric as your guiding star. 
  1. Perplexity also potentially opens up the floodgates in terms of scaling for dialogue models. If you can use unlabeled data, then you can just point your dialogue model to your own data and hope to improve its accuracy simply with more data. This was not something easily available to previous dialogue models.
  1. The approach toward dataset creation is also very interesting since it does seem relatively easy to create a dataset in the same context-response format as your own, domain-specific data. The only issue would be the amount of data needed, for comparison, GPT2 was trained on 40GB of data, and that was considered a massive amount of data at the time.

As we noted earlier, dialogue models are difficult because human conversations are a complex linguistic scaffolding building up from words to sentences to paragraphs to dialogues. Meena was a breakthrough model because it showed that this complexity could be addressed, and dialogue models were possible. If this trend continues, we would expect that we can train models on larger and more complex forms of conversational data.

In fact, this is exactly what happened with LaMDA since it represents the next advance in dialogue models. It is built on the findings of Meena and took open-domain chatbots to a new level in a very short time. But what is so special about LaMDA that it stole all the attention from Meena? Why is no one wondering if Meena is a conscious entity? Poor Meena, let’s find out why LaMDA seems so much smarter.

Google’s LaMDA: A better Meena

LaMDA broke new ground in trying to generate specific and interesting responses | Source

One of the innovative aspects of Meena was that it did not just require bot responses to be sensible, but they also needed to be specific. This meant that the bot had to “try harder” to generate a meaningful dialogue. However, this may incentivize a bot to “hallucinate often, or lack faithfulness to underlying sources” as recent research indicated.

From a business perspective, this could pose a serious problem if the “hallucination” is related to a customer issue. It is also a problem in open-domain dialogues where the model can just start “making things up”. This was one of the issues LaMDA was designed to address. Building on the early work of Meena, LaMDA introduced a number of new approaches to dialogue models, which resulted in impressive results. Namely:

  1. Interestingness: LaMDA added a new evaluation metric to the sensible and specific ones already introduced by Meena. This metric tried to further improve the dialogue quality by ensuring LaMDA tried to be witty and insightful.
  1. Groundedness: As mentioned, bots can tend to have a “flexible” relationship with the truth, so LaMDA introduced a grounded metric that enables the model to use external sources to verify claims where possible.
  1. Pretraining + fine-tuning: LaMDA showed that scaling models via pretraining did help to improve nearly all metrics. However, scaling alone did not improve metrics such as bias and groundedness. To move those metrics, the model needed fine-tuning. But the interesting thing was that fine-tuning, when combined with a technique called “prompting”, helped improve all metrics and achieve SOTA results.

These are some interesting advances that will have some far-reaching implications. So, with that in mind, let’s go through them.

An interesting dinner guest

We’ve all been there, whether at a friend’s wedding or on a long-haul flight. You pull the short straw and are seated next to a somewhat less than interesting stranger whose hobbies include the intricacies of the tax code. While the conversation is perfectly sensible and specific, it just isn’t interesting. But again, what is interesting?

The LaMDA authors claim that “interestingness” is where a dialogue response is likely to “catch someone’s attention” or “arouse their curiosity” or where the response could be considered “unexpected, witty or insightful”. Unfortunately, knowing exactly when someone is witty or interesting can be both difficult and subjective. To address this, the authors need people to manually score LaMDAs responses for things like interestingness.

As we will see for fine-tuning, people are also required to label responses in multi-turn conversations in a similar way to create datasets that can be used to improve LaMDAs performance in these areas. 

For evaluation purposes, LaMDAs quality metric consists of three individual metrics, which each received either a 1 or a 0:

  1. Sensible: This is the same as the Meena requirement that the response is logically and semantically coherent and relevant to the preceding query.
  1. Specific: Again, this is the same as the Meena-specific metric, where the response also needs to be contextually relevant.
  1. Interestingness: This is the new metric added to the LaMDA requirement. 

Similar approaches were taken for groundedness and safety. These scores enabled the authors to measure the performance of LaMDA in potentially subjective areas, such as how interesting the model seems in a given conversation. 

As noted, the human evaluators also created datasets that allowed the authors to fine-tune (which we will discuss shortly) LaMDA. Instead of generating scores for things like interestingness, the human evaluators instead generated labels such as “yes”, “no”, or “maybe”. To try and reduce subjectivity as much as possible in these labeled training datasets, the authors had 5 people label each response and then only use those responses if 3 out of 5 agreed.

Along with these human-labeled datasets, which were specifically created for LaMDA, the authors also used many commonly available datasets that other LMs were trained on. This resulted in a two-phase training approach:

  1. Pretraining: LaMDA was trained like a regular LM model on a wide range of dialogue and non-dialogue textual data. In this way, the model can be used as a general language model prior to the next stage of training, which is fine-tuning the model. The total size of the pretraining dataset was 1.56T words. 
  1. Fine-tuning: We will talk about this more shortly, but the second stage of training was where the model was fine-tuned on specific datasets like the quality-labeled dataset mentioned above. Along with quality, there was a dataset for bias and safety, and groundedness. It was this fine-tuning step that engendered the impressive SOTA results and was likely the reason it was thought to be mimicking the classic brain in a vat scenario.

It is interesting to note that the authors added an “interestingness” metric as a training step. This builds on the trend we saw with Meena, where the goal is to make a more human-like bot. But, as we stressed with Meena earlier, is this a goal you want for your business bot? Does “interesting” equate to general common sense and an ability to learn from experience and perform tasks for which the model has not been trained? 

While being interesting may not be top of the requirements, being grounded is likely to be a key skill set that is needed. LaMDA is trained with this in mind as well.

Keeping the feet grounded

One interesting aspect of neural networks that has come to light with the development of large LMs is that they can “hallucinate”. As we mentioned earlier in relation to Meena, when an LM generates text as part of an NLP task, it can be prone to simply making stuff up. This is what is referred to as hallucinating in the context of an LM. 

This is not a noticeable issue if the task is generating text in a GPT-3 like format. In these cases where the LM completes a sentence or paragraph, it can be interesting to see how “creative” the model can be when it generates new text. 

However, in dialogue models, where there is a question-and-answer type interaction, hallucination poses a serious problem. If the bot is asked a specific question, for example, “how many Tour De France green jerseys did Irish cycling legend Sean Kelly win?”, then it is something that can be verified externally. Creativity here is not important. When a model like LaMDA is trained to be interesting and specific, it could create an “incentive” for the model to hallucinate. 

This is what LaMDA would have been prone to if the authors had not added the grounded requirement. The goal of groundedness is to try and increase the likelihood that LaMDA will produce responses which are grounded in external and verifiable sources. This can be considered similar to an information retrieval approach. This enables people to judge the veracity of a response based on the reliability of its source. 

The fact that LaMDA has access to external knowledge sources reduces its tendency to hallucinate, and while the exact reasons for this are unknown, the authors postulate that the external source frees up LaMDA from using parameters to memorize information. It can, in a way, “outsource” this work to the external knowledge base much in the same way as you or I would make notes on a pad or reference information in a book. 

Groundedness as a metric is defined as the percentage of responses making claims about the external world that can be verified by external sources, as a share of all responses making claims about the external world.

As mentioned above, metrics like groundedness and safety are not improved by pretraining alone. The final stage of training, where LaMDA is fine-tuned to the human-curated datasets, yields improved results across all metrics. Fine-tuning may be of the most interest from a business perspective. If you can use a certain amount of information to easily tailor a model to a specific domain or task, then it may be useful in your particular business application.

Fine-tuning your way to consciousness

Okay, so what do we know so far? We can approach the open-ended dialogue problem as an LM problem and just feed the model data like we did BERT and let it learn. If we include dialogue-type data in the training set, then we will see some improvement in dialogue-specific metrics. So training a model similar to a general LM does improve some, but not all, dialogue metrics.

The quality metrics do show improvements from pretraining alone, but metrics like groundedness and safety do not improve. We cannot just scale the LM approach to improve these metrics, so the LaMDA authors looked for an alternative approach, namely – fine-tuning.

Firstly, what is fine-tuning in the context of LaMDA? LaMDA takes a two-step process to fine-tuning. The steps are:

  1. Generator: In the first part of fine-tuning, the LaMDA pretrained model is used to generate responses given a dialogue context. LaMDA is a decoder-only model, so it needs to be fed tokens one by one in sequence (unlike BERT, for example, which is an encoder-based model and uses masking so it can be fed tokens in any order). The generator is trained with examples like –

DIALOGUE: What’s up? 
RESPONSE: not much.

In this way, it is trained to predict the next token given the context. 

  1. Discriminator: The second step is a classifier trained to predict the labels of the manually created datasets we mentioned earlier for tasks like quality and safety. These labels enable the classifier to predict the correct score for an input response produced via the generator.
A two-step process to fine-tuning
LaMDA uses an innovative approach by separating the generating and selection of responses into two separate tasks | Source

The important point to note here is that both the steps (generator and discriminator) are trained on the one model, so that means there is an “efficient combined generate-and-discriminate procedure”. For example, given an input, the generator will produce not just one but multiple potential responses. It then passes those responses to the discriminator, which acts as a classifier to predict the quality and safety scores. Based on those predictions, the safety scores which are below a given threshold are removed, and then the remaining scores are ranked by quality, and the highest one is used as the final response. 

This is a really interesting approach given that we know, from the over-hyped media attention, that the model performs pretty impressively as an open-ended bot. It is a series of simple steps which are combined together to create a potentially complex system. 

The authors could have, for example, just trained a generator to produce the output and use that. But it is likely that the generator would not be able to produce both high-quality and high-safety responses. Or at least it would require an awful lot of data to fine-tune it. By creating separate tasks and using the same model, the authors reduce the amount of data and resources needed to train the model to a high level.

But we know models are lazy, right? Can they just get around some of these filters by hallucinating and making up high-quality data? Sure they can, but that is where the grounded metric we spoke about earlier comes into play.

During fine-tuning (and normal operation), the model has access to what the authors call a toolset. This toolset is a combination of an information retrieval system, a calculator, and a translator. To train the model to verify potential claims during fine-tuning, a grounded dialogue is created using two approaches:

  1. Static: in the static mode, the human evaluator reads over a LaMDA-generated dialogue and decides whether it makes claims that need to be verified or not. If they do need verification, then the evaluator queries the toolset via a text-based system in the exact same way LaMDA would, i.e., using a text-based query.
  1. Interactive: In the interactive mode, the evaluators carry out a dialogue directly with LaMDA and rate it as above. However, during the interactive mode, the evaluator has the opportunity to directly edit the LaMDA response. They can change it so that it includes well-sourced and verifiable claims and provides URLs or external references to those claims. 

Using these approaches, fine-tuning helped to create a new benchmark for open-domain dialogue bots. LaMDA, as we noted, is getting so much attention precisely because it responds really well, or at least better than previous bots, to conversational-like human dialogues. 

Utility of LaMDA for businesses and Machine Learning teams

Unchartered space for businesses

We noted that Meena had some goals which might not be in sync with some business applications. For example, Meena tries to avoid saying something like “I don’t know” and is designed to try and provide some specific information in its responses. While LaMDA builds on this by trying to be “interesting,” it does add the grounded approach to try and ensure that the model’s responses will, in some cases, be accurate, sourced, and verifiable. 

This does make it seem more in-line with business requirements such as answering a customer query, but, as the authors noted, LaMDA relies on a form prompting during fine-tuning where the human evaluator interacts with the model to try and get it to generate certain responses. This will be crucial to precondition the model to perform business-like tasks. However, there is much research needed into this area before we can safely and predictably use things like prompting to use these models in a business context.

HuggingFace recently released a model that uses prompting to enable the model to perform tasks for which it was not trained. You can test the HuggingFace model and see that it can be difficult to know the exact type of prompt needed to get the model to perform your specific task. 

New avenues for ML teams

There is no doubt that, similar to the Meena paper, the LaMDA research is groundbreaking in terms of the open-domain dialogue models. The results themselves show that these models have raised the bar for what we expect from future models. But from an ML perspective, it can be easy to get carried away with the latest “shiny” new developments while losing sight of some of the key technical aspects of these models. In particular, I think there are two that seem the most relevant to LaMDA:

  1. External databases

The use of an external database and a standard method for querying that DB are key methods that could be used in other models. This solves a lot of potential problems with the static nature of knowledge in models like BERT, which still thinks that Barack Obama is President

It is hard to update these models for things like Covid-19 and other new trends that change how people search for information. Having a queryable external DB means you could update the DB, and the model would be able to search for the latest information without having to re-train the model.

  1. Fine-tuning

LaMDA is a two-stage model. The first stage is a generally trained LM like we have seen with many previous Transformer based models. However, the second stage, the fine-tuned stage, is what sets LaMDA apart from these models. LaMDA involved interaction with people and was trained by editing responses and adding search queries and sources. This is a very different form of fine-tuning than anything we have seen before and opens up a lot of possibilities for future models. 

Maybe these models will be trained like new employees in the future where they sit next to an experienced person and learn from them by seeing them in action. Expect to see more advances as people start to figure out how to perform this interactive fine-tuning mode efficiently. 

Key takeaways from LaMDA and Meena

As we noted earlier, when evaluating any model from a business perspective, it is always important to think about the goal of that model and what task it is trained for. This is important because that enables you to understand the business domain or problem for which the model is best suited. In this post, we looked at two models, namely Meena and LaMDA, which are both dialogue models, and highlighted some of their key technical innovations. 

Real-world applications of LaMDA and Meena

We noted that there are some aspects of these models, such as interestingness, which might not be aligned with some business goals, but are there other areas that would benefit from that type of approach? As an example, here are two areas where these models could have an immediate impact:


Children benefit from a more focused education. As class sizes grow or the number of teachers declines, the student-teacher ratio increases. The more students in a class, the less attention any particular student receives. A model like LaMDA, as was shown in the paper when it impersonated Mount Everest, can provide 1:1 type educative dialogues to help children understand facts about science or the world around them. This offers up a unique way to help someone discover new knowledge in an interactive way without the need for individual human tutors.

Health care

In many areas of the world, there is still a lack of mental health care. It can be expensive to deliver these kinds of services to areas where they are badly needed. New health care bots such as Woebot attempt to combine technology and clinical knowledge to deliver mental health care so people can access it whenever it is needed. Models such as LaMDA could help improve the quality of these services by making the dialogues seem more human and specific. It will be interesting to see if they get utilized in this area.

Challenges of using LaMDA and Meena in real-world applications – falling short of expectations

These models, however, at present, may not be suited to more traditional business-related NLP tasks such as question and answering, summarization, or entity recognition. These tasks may be more suited to rules-based approaches or specially trained Transformer based models like those available on platforms like HuggingFace. 

The reason these models may not be suited to these tasks include:

  1. Availability 

These models are not available in the same way as previous Transformer models. Resources like HuggingFace, SentenceBERT, or TensorFlowHub allow anyone to download a whole range of models and use them in an off-the-shelf manner. That is not likely to happen with these models. Access may be limited via an API like OpenAI restricts access to GPT-3. 

  1. Business problem 

Previous Transformer models were trained to generate word embeddings or find answers to questions or summarize documents. These problems have very specific business use cases. Meena and LaMDA, as we discussed, are trained to produce responses that are specific and interesting. 

Not something that you may want from your business bot. Thus the ultimate goal of these bots may not be as neatly aligned as a model like BERT with your business use case. You may have to invest more resources trying to make these dialogue models “fit” your domain-specific application.

  1. Humans needed 

As we noted, both Meena and LaMDA required humans to manually label data and evaluate responses. LaMDA also required people to interact with it during fine-tuning to edit responses and generate search queries to improve groundedness. 

While the authors do note that “less than 0.001% of pre-training data” was human-annotated data, they still accept that “it is an expensive, time consuming, and complex process”. From a business perspective, training such models on business-specific datasets would require people with knowledge and skill in that domain. So that still represents a big obstacle to the easy application of these dialogue models.

But fine-tuning does offer hope

While much of this post, and the above points, do seem to indicate that it may not be worth investing significant resources in these models, I think there is enough evidence to show that it is worth keeping abreast of future progress in this area. 

The technique of grounding is very relevant from a business perspective since it seems like it could be similar to a question and answer type bot which has access to a domain-specific knowledge base. And the method of fine-tuning where people interact with the model and edit responses or prompt it with suggestions seems like something that could be useful in future applications if the training requirements are minimal. 

Imagine, for example, a bot that you could interact with in a few dozen dialogues to train it for the type of questions it might receive from prospective customers. If there are enough safeguards to prevent bias and safety issues and what the LaMDA authors called the “long-tail of inappropriate responses”, then that is something that could bear business fruit in the near future. 


  1. Meena Google Blog: A great overview of the Meena model
  2. Meena paper: For a more detailed look at the Meena model
  3. Meena github repo: Resource with some examples of Meena to Meena and humans to Meena conversations
  4. Original LaMDA Google blog: A good introduction to LaMDA
  5. LaMDA Google blog: This blog goes into the model in more detail
  6. LaMDA paper: While the paper is long, it does contain many interesting examples and a lot of helpful appendices 
  7. Measuring attribution in Natural Language Generation Models: This papers talk about the tendency for LLMs to hallucinate

Was the article useful?

Thank you for your feedback!