MLOps Blog

Transformer Models for Textual Data Prediction

15 min
8th August, 2023

Transformer models such as Google’s BERT and Open AI’s GPT3 continue to change how we think about Machine Learning (ML) and Natural Language Processing (NLP). Look no further than GitHub’s recent launch of a predictive programming support tool called Copilot. It’s trained on billions of lines of code, and claims to understand “the context you’ve provided and synthesizes code to match”. Your virtual pair programmer buddy!

We’ve written about these models and the underlying Transformer architecture in recent posts. We’ve also looked at the recent research on different Transformer models to understand why the underlying architecture of BERT and GPT3 learns context so much better than other models.

Now, like Copilot, we want to implement some of the features of these models in a real-life scenario. So, in this post we’ll look at the practical ways we can use Transformers to predict different features of text. Before we do that, we need to identify what types of features we can predict, and understand why predicting text is different from other forms of prediction.

To predict or not to predict? 

The image shows a page full of definitions from the dictionary.
How does predicting text differ from other forms of prediction? | Source

Predicting textual data is different from other forms of predictions. If you want to predict the price of an asset such as a stock or a house, you feed your model all the data available and predict the number which you think best represents the asset in question. The price is a number and while it can be right or wrong, it doesn’t have as many potential “dimensions of uncertainty” as text. 

Take Copilot for example. It’s predicting code snippets, so it could misspell the code, use the wrong coding language, use the incorrect variable names, and so on. Even if it gets all of that right, the code itself might not work.

Similarly, a sentence could be grammatically correct, but semantically meaningless. Noam Chomsky famously composed the sentence “colorless green ideas sleep furiously” to show the difficulty in machines being able to differentiate between syntax and semantics. 

This is what we mean when we say there are more “dimensions of uncertainty” in predicting text than prices. There are more ways in which text can be “wrong” than other forms of data, such as the price of an asset. To control for this level of uncertainty, we try to identify specific features of textual data we want to predict. 

These features should also be useful in most business-focused NLP pipelines. Taking this into consideration, the features we’ll look at in this post are:

  1. Predicting misspelling: One of the easiest ways text can be wrong is by simply misspelling one or more words in a sentence. The word is either correctly spelt or it isn’t. We see this every day with auto-correct features in apps like Gmail, so we can look at how you can easily implement features like this using Transformer-based models. 
  2. Predicting grammar errors: Something a little more difficult is identifying grammatical errors in a sentence. Can we use Transformer-based models to predict when a sentence contains grammatical errors or not? This could be something you use when processing data to identify if it needs to be changed or removed from your pipeline, since it may increase the potential for downstream errors in your system.
  3. Predicting paraphrased sentences: Often when we look at text data, we’re trying to understand if certain sentences are related. One way to think of it is whether we can predict when one sentence is paraphrasing another. There’s more nuance to this task than simply looking for spelling errors or a grammar faux-pas. The reason is that it involves identifying the context of both sentences and predicting if it seems reasonable that one is describing the same intent as the other. This is useful when processing large amounts of text data, and you want to search for or identify text which may be related without being identical.

As we look at these features, we’ll also survey a number of different NLP libraries that use Transformers to perform a range of NLP functionality. These libraries will be something you can use in your NLP pipeline, and they should also help you better understand how you can use Transformers for your specific tasks and use-cases.

The code described in this post can be found in the related Github repo.

Predicting text data with Transformers

The dataset

For predicting text data we want to use a dataset that represents “real” queries and conversations. What we mean by “real” is that it contains all the messiness of something you would type on Slack or in an email. That is, it should contain grammatical errors and spelling errors, use abbreviations, and all the other less than perfect ways we can find to craft a sentence. 

One great resource of this type of data is the Amazon Customer Support QA dataset. Specifically, we’ll use the Software category of questions, as we want to see if our approaches can identify topics within a related domain. This dataset is great because:

  • It contains real questions, with all the messiness we want, such as misspelling, questions as statements, questions with and without negation, and every other way that we care to ask the same questions in different ways.
  • All the questions have associated answers, some of which are long answers and some are short yes/no answers.
  • The dataset is separated into different categories. Some datasets contain a mixture of general questions on a variety of different topics. This doesn’t replicate the typical business customer dataset which will be related to a particular business or technical domain. 
  • It contains code examples on how best to transform it into different formats. The authors helpfully provide code snippets to assist in transforming the data into whatever format you find the most useful to parse. 

Following the steps outlined by the dataset authors, we end up with a Pandas DataFrame as shown below:

The image shows Pandas DataFrame by the author.
We use a DF here but you can transform your data into whatever data structure you find easiest to work with. | Source: Author

Predicting misspellings

The image shows a word made from scrabble letters.
Can we use transformers to better predict when a word is misspelled? | Source

As the first test of our prediction skills, we’ll try to identify misspellings in our dataset. We see this feature so often now in nearly every application that we may think it’s a simple problem. The key difference here, however, is that we’re performing a prediction task and not just suggesting a potential spelling to a human operator.

For example, if you have a pipeline where you’re doing things like cleaning new or incoming data, you will likely have a downstream task that consumes that data. The downstream task could be a model that uses the data for training purposes, or uses it for clustering, or a question and answer type model. In these cases, it’s important that domain specific data or specific terminology or slang isn’t incorrectly identified as a misspelt word and changed. This could lead to serious consequences for your downstream tasks.

Transformers have made many of these types of applications easier to implement with the availability of pre-trained models. Yet, as amazing as these models are, you still need to focus on your specific use-case. Maybe you don’t need all the bells and whistles of a pre-trained model and a simple spell check will do. Alternatively, you may want to take a look at some of the latest transformer model applications and see if they add any new and unique value to your use case. 

A good example of this approach is the neural spelling correction library, NeuSpell. NeuSpell is trained on neural networks to learn spelling correction by using the context of the neighbouring words rather than just character perturbation. To learn this context, they use models like ELMo and BERT as well as other neural networks. You can find the github repo here.  

Another approach is a contextual based approach like the one spaCy offers via their Contextual Spell Check. This uses BERT to leverage the context of a word when identifying errors or suggesting corrections. 

Let’s first take a look at some examples of both approaches and see how they look in action. We will take a sample of 20 sentences from our test set and look at both models and see what they think is a misspelling. 

for q in original_queries:
    doc = nlp(q)
    spacy_misspell = doc._.performed_spellCheck
    print(f'-------- spaCy -------- ')
    if spacy_misspell:
        print(f'spaCy> spelling error detected')
        for i in doc._.score_spellCheck:
            for s in doc._.score_spellCheck[i]:
        print(f'spaCy> No spelling error detected')
    neuspell_res = checker.correct(q)
    print(f'-------- NeuSpell -------- ')

In the code above, you can see that the spaCy model enables us to check if there was in fact a misspelling detected in a given string. You can also look at the most likely suggested corrections with a corresponding score. These are very useful features and enable you to see if there was in fact an error and how confident the model is of the potential alternative replacement words. 

For example, the query “do you need a procceser” seems to have a clearly identifiable spelling error. Yet the spaCy model seems to look at this as more of a context task than a spelling error:

do you need a procceser
spaCy: spelling error detected
{procceser: 'coffee'}
do you need a coffee
('?', 0.85842)
('.', 0.04466)
('...', 0.00479)
(';', 0.00453)
('-', 0.00416)
('coffee', 0.00275)
('!', 0.00234)
('phone', 0.00203)
('>', 0.00153)
('room', 0.00138)

Whereas the NeuSpell model does identify the correct misspelling:

do you need a processor

The spaCy model also shows more context related corrections with this example:

Does this work well for canada?
spaCy: spelling error detected
{canada: 'anyone'}
Does this work well for anyone?
('you', 0.58046)
('him', 0.11493)
('me', 0.09218)
('her', 0.06952)
('us', 0.0262)
('them', 0.02586)
('anyone', 0.00408)
('everyone', 0.00391)
('ya', 0.00098)
('now', 0.00094)

The potential corrections are interesting since they do seem more related to a contextually relevant change than a spelling error. There’s no spelling error detected if you replace the “canada” with a capital “C”, i.e. “Does this work well for Canada?” doesn’t result in any change using the spaCy model. The NeuSpell model doesn’t correct the original sentence even with the lower case spelling of Canada. 

You can run through the dataset to see some more examples of how these models compare to each other. Even from a cursory look it seems like we might have problems with both models. In general:

  1. The spaCy model seems to focus on context: The spaCy model has a lot of nice properties such as providing a confidence score for potential replacements. However, its suggestions are more based on context than misspelling. In a few simple examples, it generated some suggestions that seemed questionable if our main focus was correcting potential misspelt words. This was just one example and there may be ways to tune it using other spaCy tools, or change it and make it more suited to your use-case. But for now, it doesn’t seem like it would be useful to predict corrections for misspelt words.
  2. The NeuSpell model doesn’t provide confidence scores: The NeuSpell model seems to perform much better as a model that predicts corrections for misspelt words. It generated more reasonable suggestions than the spaCy model, so it’s something you could consider using. Whether it’s better than other standard (i.e. non neural network based) models is open to question. Also, the fact that it didn’t seem to offer a confidence score for its suggestions may limit its prediction capabilities. It would be great to be able to look at a confidence score and choose whether or not to accept the suggestion. That would make it easier to automate the process in a NLP pipeline and be confident that you aren’t “correcting” a domain specific term like a product name.
  3. A traditional spell checker may be more suited to your task: It looks like the transformer based spell checks may not offer much advantage over a traditional type approach. But it’s always good to keep an eye on new applications like this which may be improved by new NLP models like BERT or GPT.

Check also

How to Code BERT Using PyTorch – Tutorial With Examples

MLOps Tools for NLP Projects

Tokenization in NLP – Types, Challenges, Examples, Tools

Predicting grammar errors

The image shows a word made from scrabble letters.
Do Transformer models like T5 make it easier to predict grammar errors in a sentence? | Source

While our spelling models might not have been a roaring success, hopefully we’ll have better luck with identifying grammatical errors in our dataset. Again, this is useful if you want to use your data on a downstream task. 

Or you may want to use a sentence to find other similar sentences. In this case, it can be useful to know if the original text was a good, clean, well-constructed sentence. If it did contain some grammatical errors, then that might be something you want to consider when looking at things like a similarity matching score.

Models like Google’s T5 Text to Text Transformer, which was released in 2020, are examples of transformer-based models that can perform multiple tasks such as identifying correct or incorrect grammar. The interesting thing about the T5 model is that:

  • It’s trained on many NLP tasks: The model is trained on a wide range of NLP tasks from identifying similarity, summarizing text to identifying grammar errors. All of these tasks use the same model, loss function and hyperparameters. In theory, it’s similar to the way you or I learn about languages, i.e. we learn the alphabet, we learn to read and write and spell, answer questions, and then we build all this knowledge together to be able to complete very high-level language tasks. Now, we’re not saying these models understand language in the same way we do but they’re beginning to perform at or close to humans at some NLP tasks.
  • It uses text for both input and output: The other key thing about this model is that you simply tell the model what you want it to do via text, and the result it provides is also always in a textual format. This is very different from models like BERT, where the input may be a sentence, but the output is an entity such as a massively multidimensional vector. Instead, with T5 you tell it, via text, what task you want it to perform, provide the text for that task, and it will tell you the result in text. This is a big jump in terms of abstraction. Think about the difference between coding languages such as C or Java and Python. Python is such a higher level of abstraction that it enables more people to program and opens up so many more potential use cases. Similarly, while people marveled at the power of GPT-3, I was impressed by the simplicity of its interface. The real genius wasn’t in the zillions of parameters, but the ease of use which enabled people, that would otherwise have not been able to test an advanced transformer-based model, to generate text, answer questions, and be amazed with what they saw.

You can use T5 directly via its github page or from the HuggingFace library. Alternatively, and the approach I like to use first, you can try and find another library, with a layer of abstraction involved, which makes it easier to use. This way you can quickly test the model and see if it suits your use case. 

If it does, you can then invest time deploying and testing the “raw” model which may enable you to train and tune it on an even more customized basis. Lately, many of these “abstraction” type libraries let you perform these tasks as well, so you might not need to look any further.

Identifying sentence with grammatical errors

If you look through our Amazon dataset you can see sentences which aren’t grammatically correct, e.g. “Did user have any problem downloading?” or “is this good for a opening a restaurant”. 

This is to be expected, since it’s a real dataset and we all know that these types of errors can happen easily when chatting online. As we noted earlier, if you’re using your customer data for downstream tasks, or just cleaning your data as a best practice, then you will want to identify when there are grammatical errors in the sentence. Combined with spelling errors, you can create a type of quality score for your pipeline where you can rank higher quality data for whatever downstream task you have in mind.

T5 performs these types of tasks and a good “abstraction” library that builds on T5 tasks is the John Snow Labs Spark NLP library. Spark NLP is a NLP library similar to libraries like spaCy or NLTK which offers a range of NLP functions under one roof. John Snow Labs is a company that provides NLP services to industries like Health Care but also provides a range of freely available NLP utilities. Which, needless to say, is why we’re looking at them here.

To set up the task you need to first identify the model you want to use. John Snow Labs provide other models but for now we will use T5.

documentAssembler = DocumentAssembler()

# Can take in document or sentence columns
t5 = T5Transformer.pretrained(name='t5_base',lang='en')

Then you need to tell it the task you want to perform.

# Set the task on T5
t5.setTask('cola sentence:')

# Build pipeline with T5
pipe_components = [documentAssembler,t5]
pipeline = Pipeline().setStages( pipe_components)

Now we just need some sample sentences from our test data.

# Get some test sentences
sentences = test_dataset_df['question'].sample(n=20).tolist()
# Lets just use short sentences for now
sentences = [[x] for x in sentences if len(x) < 90]
df = spark.createDataFrame(sentences).toDF("text")

#Predict on text data with T5
model =
annotated_df = model.transform(df)['text','t5.result']).show(truncate=False)
The image shows an example output from identifying which sentences are grammatically correct.
This is the example output from identifying which sentences are grammatically correct | Source: Author

You can see from the above example that it identifies both a grammar error and a misspelling. We could use this in conjunction with our spelling correction libraries above to see which sentences we need to spell check. That might be another way to get around the problem we noted earlier when we didn’t have a confidence score for the likelihood of a word being spelt incorrectly. Once we know that there’s a grammar error, we could then run it through the spell check as we know there’s something wrong with the sentence.

Again, this is another “feather in our bow” of things we can do to check the quality of our data. You can check out the vast array of examples and demos available with the John Snow library here

You can find plenty of other examples of predicting text using T5 and other transformer models. You can also use the libraries pipeline structure to easily create your own ML pipeline to perform all of these tasks in sequence. For a list of all the available tasks, you can read this introductory post listing all the T5 tasks and explanations. 

Predicting paraphrased sentences

The image shows a pile of scrabble letters.
Can BERT help us predict paraphrased sentences? | Source

So far we’ve tried to predict if our test sentences contain misspelt words or grammatical errors. Now, imagine that we add these “checks” to our NLP pipeline, the output of which would be a predictor of “clean” sentences. These “clean” sentences may represent correctly formed customer queries. A final step on this pipeline may be to try and predict if a sentence is a paraphrase of an already existing sentence in your dataset.

This could be a useful tool to see if you need to process or group any new incoming sentences. If they already exist, then you may have the information already. If you’re creating a training dataset then you may want to group and label sentences accordingly. Predicting paraphrased sentences in this way has two main benefits:

  1. It’s quick: Instead of predicting a paraphrased sentence, you could find similar sentences. This would involve comparing each sentence to every other sentence to find the most similar. This can be very slow if you have a lot of data.
  2. It can use pre-trained models: Since we’re only looking for paraphrased versions of sentences, we can use the pre-trained models. Alternatively, if we were comparing sentences directly, we might need to fine-tune the model to understand things like medical or legal terminology if we were in that domain. When we use paraphrasing the model should, hopefully, mainly be concerned with looking at different ways of phrasing the question. For example, think of a sentence like “How much does deefee cost?”. This is a sentence with a made up word* which means nothing but that doesn’t stop us paraphrasing it like:
    1. What is the price of deefee?
    2. What does it cost to participate in deefee?
    3. What is the cost of deefee?
    4. Is paragoinment deefee?
    5. Is paragoinment deefee?
    6. Does deefee cost a lot?

*deefee – I generated this word via a cool website called this word does not exist which is based on GPT2 and generates, well, words that don’t exist, but it also generates a definition of them. It’s fun to see what the model comes up with. For example, it defines deefee as “a gambling event”.

Predicting paraphrasing with Sentence BERT

Sentence BERT is a great library of models that initially started using BERT with a siamese network to generate semantically meaningful sentence embeddings which you can then use for things like classification or sentence similarity. It now has a range of different models available that you can use for a number of NLP tasks, and it’s also now available as part of the HuggingFace library.

Check also

10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape

We’ll use their paraphrasing model to identify paraphrase sentences in our Amazon dataset. Note that you can also use the model via the HuggingFace library or as part of the sentence transformer library itself. 

We first need to download the model, which is easy now since it’s part of HuggingFace:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Then we need to get a sample of sentences from our dataset to test:

sentences = test_dataset_df['question'].sample(n=10).tolist()
for i, s in enumerate(sentences, 1):
    print(f'{i}: {s}')
1: Will this work on a 64bit system?
2: If I buy print shop 23 can I access my old files from print shop 22?
3: does this work on a macbook mid 2010 laptop??? (Macintosh HD OS X version 10.9.4)
4: how can i play it on my mac 0s x?
5: upgrade from FMP 6? Will FMP 10 upgrade from FMP 6?
6: Can you send me only the activation code?
7: Can you use this to renew an existing subscription to McAfee Internet Security?
8: What are options for sound editing?
9: Is this useful for walking directions?
10: does the instrumentation include a generous offering from the indian orchestra plus the hammered dulcimer?

It doesn’t look like there are many similar sentences here, so let’s see what the model thinks.

paraphrases = util.paraphrase_mining(model, sentences, top_k=1)
for paraphrase in paraphrases[0:100]:
    score, i, j = paraphrase
    para_list.append([round(score, 2), sentences[i], sentences[j]])
para_df = pd.DataFrame(para_list, columns=['Paraphrase Likelihood', 'Sentence 1', 'Sentence 2'])
para_df.index = np.arange(1, len(para_df) + 1) = 'Result'
The image shows how to predict textual data with transformers.
Predicting paraphrasing with Sentence BERT | Source: Author

We can see that the likelihood that some of these sample sentences were paraphrases of each other is low, 42% being the higher score we see here for “does this work on a macbook mid 2010 laptop??? (Macintosh HD OS X version 10.9.4)“ being a paraphrase of “how can i play it on my mac 0s x?

The beauty of this is that we can easily apply it to a large number of examples. In our Amazon dataset we have over 7,500 examples. Let’s see what we can predict as likely paraphrases. It takes just over 14s to process all 7, 588 examples. This time may vary depending on where you’re running your notebook, but in general, it’s fast considering what it’s doing under the hood.

paraphrases = util.paraphrase_mining(model, all_sentences, top_k=1)
CPU times: user 5min 34s, sys: 27.4 s, total: 6min 2s
Wall time: 13.6 s

Our top results are almost identical sentences so they’re not really paraphrases of each other.

The image shows how to predict textual data with transformers.
Predicting paraphrasing with Sentence BERT | Source: Author

But, if we look at some lower scores, we can see that it does capture some interesting linguistic nuances.

para_df.query('0.75 <= `Paraphrase Likelihood` <= 0.85')
The image shows how to predict textual data with transformers.
Predicting paraphrasing with Sentence BERT | Source: Author

For a better view let’s take a look at a random sample of the results.

for row in sample_prar_df.sample(n=20).itertuples():
    print(f'------------------ {row[1]}------------------')
Is this version a subscription product that has to be renewed every year?
------------------ 0.82------------------
Is this a subscription for one year?
Is it a renewal or not&#x61F;
------------------ 0.77------------------
Is this a 1 yr renewal. It didn't say anything on the listing. Thanks
Is this original installation software or only an update? I do not have iLife on my computer.
------------------ 0.76------------------
Is ilife 9 a stand alone program or an upgrade of previous ilife software?
Do you have to use Dragon with a microphone?
------------------ 0.75------------------
Can it work with Dragon Naturally Speaking?
Can I transfer from Quicken for PC to quick book for Mac
------------------ 0.76------------------
Is this version of Quicken for Mac ok to buy
can i use this on all three of my macs?
------------------ 0.83------------------
Can this be installed into 3 Macs?
Will this version allow for 1 user/ 2 computers?
------------------ 0.81------------------
Can I install on two computers for 1 user?
My MAC does not have a CD drive - can this be downloaded?
------------------ 0.78------------------
Does this come with a CD? Or is it a download? My new MacBook doesn't have a CD slot.
you send me the codes via email?
------------------ 0.76------------------
if i buy this one you will send the code to my e-mail ?
can it be put on computer
------------------ 0.75------------------
can i install this on my computer can my sister put it on her computer also
Better than Quicken? Is there something better than Quicken for Mac?
------------------ 0.81------------------
Is this version of Quicken for Mac ok to buy
is efile included? -- nothing is specifically mentioned other in the comparison of products
------------------ 0.79------------------
Does it include free efile
Can I just simply rip DVDs, including those infused with copy protection?
------------------ 0.84------------------
can I rip dvd's, cd's with this download?
Can you please tell me if this is compatible with Mac Mountain Lion or Mavericks?
------------------ 0.81------------------
Will this version work with mac Mavericks?
Will this work on a windows 8 tablet ?
------------------ 0.83------------------
will this work on Windows 8?
what is the license deal? What's the deal here?
------------------ 0.76------------------
Can you be more specific about license specifications? Thanks for info
Can I use this software on my macbook and if so how?
------------------ 0.83------------------
Can this software work on a Mac?
can you do business cards and is there clipart
------------------ 0.82------------------
Can you make business cards?
What are the main improvements of Manga Studio EX 5 to the previous EX version? Is it worth the upgrade?
------------------ 0.75------------------
What is the main difference between Manga Studio 5 and Manga Studio EX5?
can I install it on my laptop?
------------------ 0.85------------------
Can I install it on my desktop and laptop?

There are some interesting nuances here due to the messy and real way people phrase questions like “What is the license deal?”. Above we can see this phrased as:

  • what is the license deal? What’s the deal here?
  • Can you be more specific about license specifications? Thanks for info

These examples are seen as paraphrases of each other with a score of 0.76. That is good, since the sentences are structured quite differently. You or I would know they’re similar but it would be difficult to code up rules to know that these types of sentences are similar. 

Similarly, if we look at these examples:

  • is efile included? — nothing is specifically mentioned other in the comparison of products
  • Does it include free efile

We see these as being predicted as paraphrases with a score of 0.79. Again, the first sentence is much longer than the second and includes information that isn’t really useful in identifying whether these are related to the same topic, i.e. stating that it’s not mentioned in the product is extra information but doesn’t serve to change the nature of the question. The answer to the query “Does it include free efile?” would likely be the same answer as the first query.

In the above examples, we predicted the best example of a paraphrased sentence in the dataset. In other words, for each sentence the model will find the best example of a paraphrased alternative in the dataset. We may want to find more than one example, since, in datasets like this, there are likely to be multiple potential paraphrased examples. We can do this by changing the top_k parameter.

paraphrases = util.paraphrase_mining(model, all_sentences, top_k=5)
para_list = []
# For this example lets sort the results via the sentence index
# This way we can list all the potential paraphrase examples together 
# Rather than sorting by score which would make it more difficult to find the same examples
for paraphrase in sorted(paraphrases, key=lambda x: x[1], reverse=True):
    score, i, j = paraphrase
    para_list.append([round(score, 2), all_sentences[i], all_sentences[j]])
para_df = pd.DataFrame(para_list, columns=['Paraphrase Likelihood', 'Sentence 1', 'Sentence 2'])
para_df.index = np.arange(1, len(para_df) + 1) = 'Result'
The image shows how to predict textual data with transformers.
Predicting paraphrasing with Sentence BERT | Source: Author


In this post we’ve looked at some of the ways in which transformer-based models are being used to predict different features of text. We looked at three main areas that increased in complexity from being word based to grammar based to look at whole sentences. The main areas we looked at were:

  1. Predicting spelling errors: We used a number of libraries such as spaCy and NeuSpell, and showed how they use the underlying transformer models to try and predict spelling errors. Ultimately, the transformer based models may not add much more value to correcting spelling errors than traditional algorithms. But the key here is that we looked at some useful libraries and now better understand the problem and may use these libraries for some other features now that we’re aware of them
  2. Predicting grammatical errors: We then looked at the John Snow Labs NLP library which provides a whole range of nifty features. We used the grammatical error features to try and predict when a sentence in our dataset was grammatically correct or not. Transformer-based models did seem more suited to this task and it appeared to correctly identify when there were some errors in phrasing in a sample of sentences. Definitely something to keep in mind when implementing your next NLP pipeline. 
  3. Predicting paraphrase sentences: Ironically, as the linguistic complexity of our task increased the transformer models seem to provide more value. We used the Sentence Transformer library to find sentences in our Amazon dataset which were paraphrases of other sentences in the dataset. This seemed very useful given that the model has no specific knowledge of the technical terms, versions, and phrases that were used in our examples. This seemed like a feature that you could use to identify if a new customer query was actually just a rephrasing of a more common query.

It seems clear that there are more and more applications being built on top of transformer-based models, which will make it easier to find and use one for your specific use case. Ideally, you can find one library that performs all of the functionality you need but, as we saw here, you can use different libraries together if you can’t find all you need in one place.


Below is a list of the libraries we used in this post. You can find many more features in these libraries, so once you get started with them I encourage you to check out any of the other features you think might be useful to your use case:

  • spaCy projects: The spaCy projects is a good place to check for cool new models. You can upload your own model here if you like!
  • Neuspell: We used this library for a different example of spell checking.  
  • John Snow Labs: For the grammar checks we used examples from this library. As I noted, there are so many cool NLP features and models made available for free here that you’re very likely to find something useful for your use case.
  • Sentence Transformers: Finally, for the paraphrasing, we used the Sentence Transformer library. As with the John Snow Labs, there are lots of great examples in their website of different applications of transformer-based models.

Was the article useful?

Thank you for your feedback!