Neptune Blog

Deploying Conversational AI Products to Production With Jason Flaks

Stephen Oladele

23 min

19th August, 2024

MLOps Natural Language Processing

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.

Every episode is focused on one specific ML topic, and during this one, we talked to Jason Falks about deploying conversational AI products to production.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 How to develop products with conversational AI
2 The requirements for deploying conversation AI products
3 Whether its better to build products on proprietary data in-house or use off-the-shelf
4 Testing strategies for conversational AI
5 How to build conversational AI solutions for large-scale enterprises

Sabine: Hello everyone, and welcome back to another episode of MLOps Live. I’m Sabine, your host, and I’m joined, as always, by my co-host Stephen.

Today, we have Jason Flaks with us, and we’ll be talking about deploying conversational AI products to production. Hi, Jason, and welcome.

Jason: Hi Sabine, how’s it going?

Sabine: It’s going very well, and looking forward to the conversation.

Jason, you are the co-founder and CTO of Xembly. It’s an automated chief of staff that automates conversational tasks. So it’s a bit like an executive assistant bot, is that correct?

Jason: Yeah, that’s a great way to frame it. So the CEO of most companies have people assisting them, maybe an executive assistant, maybe a chief of staff. This occurs so the CEO can focus their time on really important and meaningful tasks that power the company. The assistants are there to help handle some of the other tasks in their day, like scheduling meetings or taking meeting notes.

We are aiming to automate that functionality so that every worker in an organization can have access to that help, just like a CEO or someone else in the company would.

Sabine: Awesome.

We’ll be digging into that a bit deeper in just a moment. So just to ask a little bit about your background here, you have a pretty interesting one.

You have a bit of education in music composition, math, and science before you get more into the software engineering side of things. But you have started out in software design engineering, is that correct?

Jason: Yeah, that’s right.

As you mentioned, I did start out earlier in my life as a musician. I had a passion for a lot of the electronic equipment that came from music, and I was good at math as well.

I started in college as a music composition major and a math major and then was ultimately looking for some way to combine those two. I landed in a master’s program that was an electrical engineering program exclusively focused on professional audio equipment, and that led me to an initial career in signal processing, doing software design.

That was kind of my out-of-the-gate job.

Sabine: So you find yourself in the intersection of different interesting areas, I guess.

Jason: Yeah, that’s right. I’ve really always tried to stay a little bit close to home around music and audio and engineering, even to this day.

While I’ve drifted a little bit away from professional audio, music, live sound, speech, and natural language, it’s still tightly coupled into the audio domain, so that’s remained kind of a piece of my skill set throughout my whole career.

Sabine: Absolutely. And on the topic of equipment, you were involved in developing the Connect, right? (Or the Xbox).

Was that your first touch with speech recognition, a machine learning application?

Jason: That’s a great question. The funny thing about speech recognition is it’s really a two-stage pipeline:

The first component of most speech recognition systems, at least historically, is extracting features. That’s very much in the audio signal processing domain, something that I had a lot of expertise in from other parts of my career.

While I wasn’t doing speech recognition, I just was familiar with fast fourier transforms and a lot of the componentry that goes into that front end, the speech recognition stack.

But you’re correct to say that when I joined the Connect Camera team, it was kind of the first time that speech recognition was really put in from my face. I naturally gravitated towards it because I deeply understood that early part of the stack.

And I found it was really easy for me to transition from the world of audio signal processing, where I was trying to make guitar distortion effects, to suddenly breaking down speech components for analysis. It really made sense to me, and that’s where I kind of got my start.

It was a super compelling project to get my start because the Connect Camera was really the first consumer commercial product that did open microphone, no push-to-talk speech recognition at that point in time there were no products in the market that allowed you to talk to a device without pushing a button.

You always had to push something and then speak to it. We all have Alexa or Google Homes now. Those are common, but before those products existed, there was the Xbox Connect Camera,

You can go traverse the patent literature and see how the Alexa device references back to those original Connect patents. It was truly an innovative product.

Sabine: Yeah, and I remember I once had a lecturer who said that about human speech, that it’s the single most complicated signal in the universe, so I guess there is no shortage of challenges in that area in general.

Jason: Yeah, that’s really true.

What is conversational AI?

Sabine: Right, so, Jason, to kind of warm you up a bit… In 1 minute, how would you explain conversational AI?

Jason: Wow, the 1 minute challenge. I’m excited…

So human dialogue or conversation is basically an unbounded, infinite domain. Conversational AI is about building technology and products that are capable of interacting with humans in this unbounded conversational domain space.

So how do we build things that can understand what you and I are talking about, partake in the conversation, and actually transact on the dialogue as it happens as well.

Sabine: Awesome. And that was very well condensed. It was like, well, within the minute.

Jason: I felt a lot of pressure to go so fast that I overdid it.

What aspects of conversational AI is Xembly currently working on?

Sabine: I wanted to ask a little bit about what your team is working on now. Are there any particular aspects of conversational AI that you’re working on?

Jason: Yeah, that’s a really good question. So there are really two sides of the conversational AI stack that we work on.

Chatbot

This is about enabling people to engage with our product via conversational speech. As we kind of mentioned at the start of this conversation, we are aiming to be an automated chief of staff or an executive assistant.

The way you interact with someone in that role is generally conversationally, and so our ability to respond to employees via conversation is super helpful.

Automated note-taking

The question becomes, how do we sit in a conversation like this over Zoom or Google Meet or any other video conference provider and generate well-written pros nodes that you would immediately send out to the people in the meeting that explain what happened in the meeting?

So this is not just a transcript. This is how we extract the action items and decisions and roll up the meeting into a readable summary such that if you weren’t present, you would know what happened.

Those are probably the two big pieces of what we’re doing in the conversational AI space, and there’s a lot more to what makes that happen, but those are kind of the two big product buckets that we’re covering today.

Sabine: So if you could sum it up on a high level, how do you go about developing this for your product?

Jason: Yeah, so let’s talk about notetaking. I think that’s an interesting one to walk through…

The first step for us is to break down the problem.

Meeting notes is actually a really complicated thing on some level. There’s a little nuance to how every human being sends different notes, so it required us to take a step back to figure out –

What’s the nugget of what makes meeting notes valuable to people and can we quantify it into something that’s structured that we could repeatedly generate?

Machines don’t deal well with ambiguity. You need to have a structured definition around what you’re trying to do so your data annotators can label information for you.

If you can’t give them really good instructions on what they’re trying to label, you’re going to get wishy-washy results.

But also just because in general, if you really want to build a crisp concrete system that produces repeatable results, you really need to define the system, so we spend a lot of time upfront just figuring out what is the structure of proper meeting notes.

In our early days, we definitely landed on the notion that there are really two critical pieces to all meeting notes.

1 The actions that come out of the meeting that people need to follow up on.
2 A linear recap that summarizes what happened in the meeting – ideally topic bounded so that it covers the sections of the meetings as they happened.

Once you have that framing, you have to make that next leap to then define what those individual pieces look like so that you understand what the different models in the pipeline that you need to build to actually achieve it.

Scope of the conversational AI problem statements

Sabine: Was there anything else you wanted to add to that?

Jason: Yeah, so if we think just a little bit about something like action items so how does one go about defining that space so that it’s something tractable for a machine to find?

A good example is that in almost every meeting, people say things like I’m going to go and walk my dog because they’re just conversing with people in the meeting about things they’re going to do that’s non-work related.

So you have things in a meeting that are non-work related, you have things that are actually happening in a meeting that are actually being transacted on at that moment. I’m going to update that row in the spreadsheet, and then you have true acronyms, things that are actually work that must be initiated after the meeting happens that someone’s accountable for that’s on that call.

So how do you scope that and really refine that into a very particular domain that you can teach a machine to find?

Turns out to be a super challenging problem. We’ve spent a lot of effort doing all that scoping and then initiating the data collection process so that we can start building these models.

On top of that, you have to figure out what is the pipeline to build these conversational AI systems; It’s actually twofold.

1 There’s understanding the dialogue itself – just understanding the speech, but to transact on that data, in a lot of cases, requires that you normalize that data into something that a machine understands. A good example is just dates and times.
2 Part one of the system is understanding that someone said, “I’ll do that next week,” but that’s insufficient to transact on, on its own. If you want to transact on next week, you have to actually understand in computer language what next week actually means.

That means you have some reference to what the current date is. You need to actually be clever enough to know that next week actually means some time range, that is, in the following week from the current week that you’re in.

There’s a lot of complexity and different models you have to run to be able to do all of that and be successful at it.

Getting a conversational AI product ready

Stephen: Awesome… I’m sort of looking at digging more deeper into the note-taking that’s the product you talked about.

I’m going to be coming from the angle of production, of course, getting that to reward users, and the ambiguity stems from there.

So before I go into that complexity, I want to understand how do you deploy such products? I want to know whether there are specific nuances or requirements you put in place or if this is just typical pipeline deployment and then workflow, and then that’s it.

Jason: Yeah, that’s a good question.

I’d say, first and foremost, probably one of the biggest differences in conversational AI deployments in this notetaking stack, perhaps from the larger traditional machine learning space that exists in the world, relates to what we were talking about earlier because it’s an unbounded domain.

Fast, iterative data labeling is absolutely critical to our stack. And if you think about how conversation or dialogue or just language in general works, you and I can make up a word right now, as far as even the largest language model in the world – if we want to take GPT-3 today – that’s an undefined token for them.

We just created a word that’s out of vocabulary, they don’t know what it is, and they have no vector to support that word. And so language is a living thing. It’s constantly changing. And so, if you want to support conversational AI, you really need to be prepared to deal with the dynamic nature of language constantly.

That may not sound like it’s a real problem (that people are creating words on the fly all the time), but it really is. Not only is it a problem in just the general two friends chatting in a room, but it’s actually an even bigger problem from a business perspective.

Every day, someone wakes up and creates a new branded product, and they invent a new word, like Xembly, to put on top of their thing, you need to make sure that you understand that.

So a lot of our stack, first of all, out of the gate, is making sure that we have good tooling for data labeling. We do a lot of semi-supervised type learning, so we need to be able to collect data quickly.

We need to be able to label it quickly. We need to be able to produce metrics on the data that we’re getting just off of the live data feeds so that we can use some unlabeled data with our labeled data mix in there.

I think another huge component, as I kind of was mentioning earlier, is Conversational AI tends to require large pipelines of machine learning. You usually cannot do a one-shot, “here’s a model,” then it handles everything no matter what you’re reading today.

In the world of large language models, there are generally a lot of pieces to make an end-to-end stack work. And so we actually need to have a full pipeline of models. We need to be able to quickly add pipelines into that stack.

It means you need good pipeline architecture such that you can interject new models anywhere in that pipeline as needed to make everything work as needed.

Solving different conversational AI challenges

Stephen: If you could walk us through your end-to-end stack for notable products.

Let’s just sort of see how much of a challenge each one actually poses and maybe how your team solves them as well.

Jason: Yeah, the stack consists of multiple models.

Speech recognition

It starts at the very beginning with basically converting speech to text; It’s like the foundational component – so traditional speech recognition.

We want to answer the question, “how do we take the audio recording that we have here and get a text document out of that?”

Speaker segmentation

Since we’re dealing with dialogue, and in many cases, dialogue and conversation where we don’t have distinct audio channels for every speaker, there’s another huge component to our stack – speaker segmentation.

For example, I might wind up in a situation where I have a Zoom recording, where there are three independent people on channels and then there are six people in one conference room talking on a single audio channel.

To ensure the transcript that comes from the speech recognition system maps to the dialog flow correctly, we need to actually understand who’s distinctly speaking.

It’s not good enough to say, well, that was conference room B, and there were six people there, but I only understand it’s conference room B. I really need to understand every distinct speaker because part of our solution requires that we actually understand the dialogue – the back-and-forth interactions.

I need to know that this person said “no” to this request made by another person over here. With text in parallel, we net out with a speaker assignment who we think is speaking. We start a little bit with what we call “blind speaker segmentation.”

That means we don’t necessarily know who is whom, but we do know there are different people. Then we subsequently try to run audio fingerprinting type algorithms on top of it so that we can actually identify specifically who those people are if we’ve seen them in the past. Even after that, we kind of have one last stage in our pipeline. We call it our “format stage.”

Format stage

We run punctuation algorithms and a bunch of other small pieces of software so that we can net out with what looks like a well-structured transcript, where we’ve kind of landed in this stage now, where we know Sabine was talking to Stephen was talking to Jason. We have the text that allocates to those bounds. It’s reasonably well-punctuated. And now we have something that is hopefully a readable transcript.

Forking the ML pipeline

From there, we fork our pipeline. We run in two parallel paths:

1 Generating action items
2 Generating recaps.

For action items, we run proprietary models in-house that are basically attempting to find spoken action items in that transcript. But that turns out to be insufficient because a lot of times in a meeting, what people say is, “I can do that”. If I gave you meeting notes at the end of the meeting and you got something that said action item, “Stephen said, I can do that,” that wouldn’t be super useful to you, right?

There are a bunch of things that have to happen once I found that phrase to make that into well-written pros, as I mentioned earlier:

we have to dereference the pronouns.
we have to go back through the transcript and figure out what that was.
we reformat it.

We tried to restructure that sentence into something that’s well-written. It’s like starting with the verb, replacing all those pronouns, so “I can do that” turns into “Stephen can update the slide deck with the new architecture slide.”

The other things that we do in that pipeline we run components to both do what we call owner extraction and due date extraction. Owner extraction is understanding the owner of a statement was I, and then knowing who I pertain to back in that transcript in the dialogue and then assigning the owner correctly.

Due date detection, as we mentioned, is how do I find the dates in that system? How do I normalize them so that I can present them back to everyone in the meeting?

Not that it was just due on Tuesday, but Tuesday actually means January 3, 2023, so that perhaps I can put something on your calendar so that you can get it done. That’s the action item part of our stack, and then we have the recap portion of our stack.

Along that part of our stack [recap portion], we’re really trying to do two things.

One, we’re trying to do blind topic segmentation, “How do we draw the lines in this dialogue that roughly correlate to kind of sections of the conversation?”

When we’re done here, someone would probably go back and listen to this meeting or this podcast and be able to kind of group it into sections that seem to align with some sort of topic. We need to do that, but we don’t really know what those topics are, so we use some algorithms.

We like to call these change point detection algorithms. We’re looking for a kind of systemic change in the flow of the nature of the language that tells us this was a break.

Once we do that, we then basically do abstractive summarization. So we use some of the modern large language models to generate well-written recaps of those segments of the conversation so that when that part of the stack is done, you net out with two sections or action items and now are well-written recaps, all with nicely written statements that you can hopefully immediately send out to people right after the meeting.

Build vs. open-source: which conversational AI model should you choose?

Stephen: It seems like a lot of models and sequences. It feels a little complex, and there’s a lot of overhead, which is exciting for us as we can slice through most of these things.

You mentioned most of these models being in-house proprietary.

Just curious, where do you leverage those state-of-the-art strategies or off-the-shelf models, and where do you feel like this has already been solved versus the things that you think can be solved in-house?

Jason: We try not to have the not invented here problem. We’re more than happy to use publicly available models if they exist, and they help us get where we’re going.

There’s generally one major problem in conversational speech that tends to necessitate you build your own models versus using off-the-shelf. That’s because the domain we talked about earlier is so big – you actually can net out having a reverse problem by using very large models.

And statistically, language at scale may not reflect the language of your domain, in which case using a large model can net out with not getting the results you’re looking for.

We see this very often in speech recognition; a good example would be a proprietary speech recognition system from, let’s just say, Google for example.

One of the problems we’ll find is Google has had to train their systems to deal with transcribing all of YouTube. The language of YouTube does not actually generally map well to the language of corporate meetings.

It doesn’t mean they’re not right from the larger general space, they are. What I mean is YouTube is probably a better representation of language in the macro domain space.

We’re dealing in the sub-domain of business speech. This means if you’re probabilistically, like most machine learning models are trying to do, predicting words based on the general set of language versus the kind of constrained domain of what we’re dealing with in our world, you’re often going to predict the wrong word.

In those cases, we found it’s better to build something – if not proprietary, at least trained on your own proprietary data – in-house versus using off-the-shelf systems.

That said, there are definitely cases at summarization I mentioned that we do recap summarization. I think we’ve reached a point where you would be silly not to use a large language model like GPT-3 to do that.

It has to be fine-tuned, but I think you’d be silly to not use that as a base system because the results just exceed what you’re going to be able to do.

Summarizing text is difficult to well such that it’s extremely readable, and the amount of text data you would need to acquire to train something that would do that well, as a small company, it’s just not conceivable anymore.

Now, we have these great companies like OpenAI that have done it for us. They’ve gone out and spent ridiculous sums of money training large models on amounts of data that would be difficult for any smaller organization to do.

We can just leverage that now and get some of the benefits of these really well-written summaries. All we now have to do is adapt and finetune it to get the results that we need out of it.

Challenges of running complex conversational AI systems

Stephen: Yeah, that’s quite interesting, and maybe I’d love us to go deeper into these challenges you face because running a complex system means it can range from the team setup to problems with computing and then you talk about quality data.

In your experience, what are the challenges that “break the system” and then you’ll go back there and fix them to get them up and running again?

Jason: Yeah, so there are a lot of problems in running these types of systems. Let me try to cover a few.

Before getting into the live inference production side of things, one of the biggest problems is what we call “machine learning technical debt” when you’re running these daisy chain systems.

We have a cascading set of models that are dependent or can become dependent on each other, and that can become problematic.

This is because when you train your downstream algorithms to handle errors coming from further upstream algorithms, introducing a new system can cause chaos.

For example, say my transcription engine makes a ton of mistakes in transcribing words. I have a gentleman on my team whose name always gets transcribed incorrectly (it’s not a traditional English name).

If we build our downstream language models to try to mask that and compensate for it, what happens when I suddenly change my transcription system or put a new one in place that actually can handle it? Now everything falls to pieces and breaks.

One of the things we try to do is not bake the error from our upstream systems into our downstream systems. We always try to assume that our models further down the pipeline are operating pure data so that they’re not coupled, and that allows us to independently upgrade all of our models and all our system with ideally not paying that penalty.

Now, we’re not perfect. We strive to do that, but sometimes you run into a corner where you have no choice but to really get quality results you have to do that.

But ideally, we strive for complete independence of the models in our system so that we can update them without then having to go update every other model in the pipeline – that’s a danger that you can run into.

Suddenly, when I updated my transcription system, I was getting that word I wasn’t transcribing anymore, but now I have to go upgrade my punctuation system because that changed how punctuation works. I have to go upgrade my action item detection system. My summarization algorithm doesn’t work anymore. I have to go fix all that stuff.

You can really trap yourself in a dangerous hole where the cost of making changes becomes extreme. That’s one component of it.

The other thing we found is when you’re running a daisy chain stack of machine learning algorithms, you need to be able to quickly rerun systems through your pipeline in any component of your pipeline.

Basically, to come down to the root of your question, we all know things break in production systems. It happens all the time. I wish it didn’t, but it does.

When you’re running queued daisy chain machine learning algorithms, if you’re not super careful, you can either run into systems where data starts backing up and you have huge latency if you don’t have enough storage capacity and wherever you’re keeping that data along the pipeline, things can start to implode. You can lose data. All sorts of bad things can happen.

If you properly maintain data across the various states of your system and you build good tooling so that you can constantly quickly rerun your pipelines, then you can find that you can get yourself out of trouble.

We built a lot of systems internally so that if we have a customer complaint or they didn’t receive something they expected to receive, we can go quickly find where it failed in our pipeline and quickly reinitiate it from precisely that step in the pipeline.

After we fixed any issue we uncovered, maybe we had a small bug that we accidentally deployed, maybe it was just an anomaly, or we had some weird memory spike or something that caused the container to crash mid-pipeline.

We can quickly just hit that step, push it through the rest of the system, and exit it out the end of the customer without the systems backing up everywhere and having a catastrophic failure.

Stephen: Right, and are these pipelines running as independent services, or they are different architectures to how they run?

Jason: Yeah, so almost all of our models of system run as individual services, independent. We use:

Kubernetes and Containers: to scale.
Kafka: our pipelining solution for passing messages between all the systems.
Robin Hood Faust: helps to orchestrate the different machine learning models down the pipeline. And we’ve leveraged that system as well.

How did Xembly set up the ML team?

Stephen: Yeah, that’s a great point.

In terms of the ML team set-up, does the team sort of leverage language experts in some sense, or how do you leverage language experts? And even on the operation side of things, is there a separate operations team, and then you have your research or ml engineers doing these pipelines and stuff?

Basically, how’s your team set up?

Jason: In terms of the ml side of our house, there are really three components to our machine learning team:

Applied research team: they are responsible for the model building, the research side of “what models do we need,” “what types of model,” “how do we train and test them.” They generally build the models, constantly measuring precision and recall and making changes to try to improve the accuracy over time.
Data annotation team: their role is to label some sets of our data on a continuous basis.
Machine learning pipeline team: this team is responsible for doing the core software development engineering work to host all these models, figure out how the data looks on the input, the output side, how it wants to be exchanged between the different models across the stack and just the stack itself.

For example, in all of those pieces we talked about Kafka, Faust, MongoDB databases. They care about how we get all that stuff interacting together.

Compute challenges and large language models (LLMs) in production

Stephen: Nice. Thanks for sharing that. So I think another major challenge we associate with deploying large language models is in terms of the compute power whenever you get into production, right? And this is the challenge with GPT, as Sam Altman would always tweet.

I’m just curious, how do you sort of navigate that challenge of the compute power in production?

Jason: We do have compute challenges. Speech recognition, in general, is pretty compute-heavy. Speaker segmentation, anything that’s generally dealing with more of the raw audio side of the house, tends to be compute-heavy, and so those systems usually require GPUs to do that.

First and foremost, let’s say that we have some parts of our stack, especially the audio componentry, that tend to require heavy GPU machines to operate some of the pure language side of the house, such as the natural language processing model. Some of them can be handled purely on CPU processing. Not all, but some.

For us, one of the things is really understanding the different models in our stack. We must know which ones have to wind up on different machines and make sure we can procure those different sets of machines.

We leverage Kubernetes and Amazon (AWS) to ensure our machine learning pipeline has different sets of machines to operate on, depending on the types of those models. So we have our heavy GPU machines, and then we have our more kind of traditional CPU-oriented machines that we can run things on.

In terms of just dealing with the cost of all of that and handling it, we tend to try to do two things:

1 Independently scale our pods within Kubernetes
2 Scale the underlying EC2 hosts as well.

There’s a lot of complexity in doing that, and doing it well. Again, just talking to some of the earlier things we mentioned in our system around pipeline data and winding up with backups and crashing, you can have catastrophic failure.

You can’t afford to over under scale your machines. You need to make sure that you’re effective at spinning up machines and spinning down machines and doing that hopefully right before the traffic comes in.

Basically, you need to understand your traffic flows. You need to make sure that you set up the right metrics, whether you’re doing it off CPU load or just general requests.

Ideally, you’re spinning up your machines at the right time such that you’re sufficiently ahead of that inbound traffic. But it’s absolutely critical for most people in our space that you do some type of auto-scaling.

At various points in my career doing speech recognition, we’ve had to run hundreds and hundreds and hundreds of servers to operate at scale. It can be very, very expensive. Running those servers at 03:00 in the morning if your traffic is generally domestic US traffic it’s just flushing money down the toilet.

If you can bring your machine loads down during that period of night, then you can save yourself a ton of money.

How do you ensure data quality when building NLP products?

Stephen: Great. I think we’ll just jump right into some questions from the community straight away.

Right, so the first question this person asks, quality data is a key requirement for building and deploying conversational AI and general NLP products, right?

How would you ensure that your data is high-quality throughout the life cycle of the product?

Jason: Pretty much, yeah. That’s a great question. Data quality is critical.

First and foremost, I’d say we actually strive to collect our own data. We found in general that a lot of the public datasets that are out there are actually insufficient for what we need. This is particularly a really big problem in the conversational speech space.

There are a lot of reasons for that. One. Just again, coming back to the size of the data, I once did a little bit of an estimate of what the rough size of conversational speech was, and I came up with some number, like 1.25 quintillion utterances would be what you’d need to roughly cover the entire size of conversational speech.

That’s because speech suffers from – besides a large number of words, they can be infinitely strung together. They can be infinitely strong together because, as you guys will probably find when you edit this podcast, when we’re done, a lot of us speak incoherently. It’s okay, we’re capable of understanding each other in spite of that.

There’s not a lot of actual grammatical structure to spoken speech. We try, but it actually generally does not follow grammatical rules like we do for written speech. So the written speech domain is this big.

The conversational speech domain is really infinite. People stutter. They repeat words. If you’re operating on trigrams, for example, you have to actually accept “I I I,” the word “I” three times in a row stuttered as a viable utterance, because that happens all the time.

Now expand that out to the world of all words and all combinations, and you’re literally in an infinite data set. So you have the scale problem where there really isn’t sufficient data out there in the first place.

But you have some other problems just around privacy, legality, there are all sorts of issues. Why there aren’t large conversational data sets out there? Very few companies are willing to take all their meeting recordings and put them online for the world to listen to.

That’s just not something that happens out there. There’s a limit to the amount of data, if you look for conversational data sets that are out there, like actual live audio recordings, some of them were manufactured, some of them were like conference data, doesn’t really relate to the real world.

You can sometimes find government meetings, but again, those don’t relate to the world that you’re dealing with. In general, you wind up having to not leverage data that’s out there on the internet. You need to collect your own.

And so the next question is, once you have your own, how do you make sure that the quality of that data is actually sufficient? And that’s a really hard problem.

You need a good data annotation team to start with and very, very good tooling we’ve made use of Label Studio is an open source. I think there’s a paid version as well – we make good use of that tool to quickly label lots and lots of data, you need to give your data annotators good tools.

I think people underappreciate how important the tooling for data labeling actually is. We also try to apply some metrics on top of our data so that we can analyze the quality of the data set over time.

We constantly run what we call our “mismatch file.” This is where we take what our annotators have labeled and then run it through our model, and we look where we get differences.

When that’s finished, we do some hand evaluation to see if the data was correctly labeled, and we repeat that process over time.

Essentially, we’re constantly checking new data labeling against what our model predictions are over time so that we are sure that our data set remains of high quality.

What domains does the ML team work on?

Stephen: Yeah, I think we forgot to ask the earlier part of the episode, I was curious, what domains does the team work on? Is it like a business domain or just a general domain?

Jason: Yeah, I mean, it’s generally the business domain. Generally, in corporate meetings, that domain still is fairly large in the sense of we’re not particularly focused on any one business.

There are a lot of different businesses in the world, but it’s mostly businesses. It’s not consumer-to-consumer. It’s not me calling my mother, it’s employees in a business talking to each other.

Testing conversational AI products

Stephen: Yeah, and I’m curious, this next question, by the way, is from some of the companies want to ask what’s your testing strategy for Conversational AI and generally NLU products?

Jason: We have found testing in natural language really difficult in terms of model building. We do obviously have a train and test data set. We follow the traditional rules of machine learning model building to ensure that we have a good test set that’s evaluating the data.

We have at times tried to allocate kind of golden data sets, golden meetings for our notetaking pipeline that we can at least check to kind of get a gut check, “hey, this new system doing the right thing across the board.”

But because the system is so big, often we found that those tests are nothing other than a gut check. They’re not really viable for true evaluation at scale, so we generally test live – it’s the only way we found to sufficiently do this in an unbounded domain.

It works in two different ways depending on where we are in development. Sometimes we deploy models and run against live data without actually using the results to the customers.

We’ve structured all of our systems because we have this well-built daisy chain machine learning system where we can inject ML steps anywhere in the pipeline and run parallel steps that allows us to sometimes say, “hey, we’re going to run a model in silent mode.”

We have a new model to predict action items, we’re going to run it, and we’re going to write out the results. But that’s not what the rest of the pipeline is going to operate on. The rest of the pipeline is going to operate on the old model, but at least now, we can do an ad test and look at what both models produced and see if it looks like we’re getting better results or worse results.

But even after that, very often, we’ll push a new model out into the wild on only a percentage of traffic and then evaluate some top-line heuristics or metrics to see if we’re getting better results.

A good example in our world would be that we hope that customers will share the meeting summaries we send them. And so it’s very easy for us, for example, to change an algorithm in the pipeline and then go see, “hey, are our customers sharing our meeting notes more often?”

Because that sharing of the meeting notes tends to be a pretty good proxy for the quality of what we delivered to the customer. And so there’s a good heuristic that we can just track to say, “hey, did we get better or worse with that?”

That’s generally how we test. A lot of live in the wild testing. Again, mostly just due to the nature of the domain. If you’re dealing in a nearly infinite domain, there’s really no test set that’s probably going to ultimately quantify whether or not you got better or not.

Maintaining the balance between ML monitoring and testing

Stephen: And where’s your fine line between monitoring in production versus actual testing?

Jason: I mean, we’re always monitoring all parts of our stack. We’re constantly looking for simple heuristics on the outputs of our model that might tell us if something’s gone astray.

There are metrics like perplexity, which is something that we use in language to detect whether or not we’re producing gibberish.

We can do simple things like just count the number of action items that we predict in a meeting that we constantly track that kind of just tell us are we going off the rails or something like that, along with all sorts of monitoring that we have around just general health of the system.

For example:

Are all the docker containers running?
Are we eating up too much CPU or too much memory?

That’s one side of the stack which I think is a little bit different from the kind of model building side of the house, where we’re constantly building and then running our training data we produce and send our results as part of a daily build for our models.

We’re constantly seeing our precision-recall metrics as we’re labeling data off the wire and ingesting new data. We can constantly test the model builds themselves to see if our precision-recall metrics are perhaps going off the rails in one direction or another.

Open-source tools for conversational AI

Stephen: Yeah, that’s interesting. All right, let’s jump right into the next question this person asked: Can you recommend open-source tools for conversational AI?

Jason: Yeah, for sure. In the speech recognition space, there are speech recognition systems like Kaldi – I highly recommend it; It’s been one of the backbones of speech recognition for a while.

There are definitely newer systems, but you can do amazing things with Kaldi for getting up and running with speech recognition systems.

Clearly, systems like GPT-3, I would strongly recommend to people. It’s a great tool. I think it needs to be adapted. You’re going to get better results if you finetune it, but they’ve done a great job of providing APIs and making it easy to update those as you need.

We make a lot of use of systems like SpaCy for entity detection. If you’re trying to get up and running in natural language processing in any way, I strongly recommend you get to know spaCy well. It’s a great system. It works amazing out of the box. There’s all sorts of models. It gets consistently better throughout the years.

And I mentioned earlier, just for data labeling, we use Label Studio, that’s an open-source tool for data labeling that supports labeling of all different types of content audio, text, and video. They’re really easy to get going out of the box and just start labeling data quickly. I highly recommend it to people who are trying to get started.

Building conversational AI products for large-scale enterprises

Stephen: All right, thanks for sharing. Next question.

The person asks, “How do you build conversational AI products for large scale enterprises?” What considerations would you put in place when it starts in the project?

Jason: Yeah, I would say with large-scale organizations where you’re dealing with very high traffic loads, I think, for me, the biggest problem is really cost and scale.

You’re going to wind up needing a lot, a lot of server capacity to handle that type of scale in a large organization. And so, my recommendation is you really need to think through the true operation side of that stack. Whether or not you’re using Kubernetes, whether or not you’re using Amazon, you need to think about those auto-scaling components:

What are the metrics that are going to trigger your auto-scaling?
How do you get that to work?

Scaling pods and Kubernetes on top of auto-scaling EC2 hosts underneath the covers is actually nontrivial to get to work quickly. We talked before also about the complexity around some types of models that generally tend to need GPU for compute, others don’t.

So how do you distribute your systems onto the right type of nodes and scale them independently? And I think it also winds up being a consideration of how you allocate those machines.

What machines do you buy depending on the traffic? Which machines do you reserve? Do you buy spot instances to reduce costs? These are all the considerations in a large-scale enterprise that you must consider when getting these things up and running if you want to be successful at scale.

Deploying conversational AI products on edge devices

Stephen: Awesome. Thanks for sharing that.

So let’s jump right into the next one. How do you deal with deployment and general production challenges with on-device conversational AI products?

Jason: When we say on device, are we talking about onto servers or onto more like constrained devices?

Stephen: Oh yeah, constrained devices. So edge devices and devices that don’t have that compute power.

Jason: Yeah, I mean, in general, I haven’t dealt with deploying models into small compute devices in some years. I can just share historically for things like the connected camera. When I worked on that, for example.

We distributed some load between the device and the cloud. For fast response, low latency things, we would run small-scale components of the system there but then shovel the more complex components off to the cloud.

I don’t know how much this relates to answer the question that this user was asking, but this is something that I have dealt with in the past where basically you run a very lightweight small speech recognition system on the device to maybe detect a wake word or just get the initial system up and running.

But then, once it’s going, you funnel all large-scale requests off to a cloud instance because you just generally can’t handle the compute of some of these systems on a small, constrained device.

Discussion on ChatGPT

Stephen: I think it would be a crime for this episode without discussing ChatGPT. And I’m just curious, this is a common question, by the way.

What’s your opinion on ChatGPT and how people are using it today?

Jason: Yeah. Oh my god, you should ask me that at the start because I can probably talk for an hour and a half about that.

ChatGPT and GPT, in general, are amazing. We’ve already talked a lot about this, but because it’s been trained in so much language, it can do really amazing things and write beautiful text with very little input.

But there are definitely some caveats with using those systems.

One is, as we mentioned, it is still a fixed train set. It’s not dynamically updated, so one thing to think about is whether it can actually maintain some state within a session. If you invent a new word while having a dialogue with it, it will generally be able to leverage that word later in the conversation.

But if you end your session and come back to it, it has no knowledge of that ever again. Some other things to be concerned about again because it’s fixed, it really only knows about things from, I think, 2021 and before.

The original GPT3 was from 2018 and before, so it’s unaware of modern events. But I think maybe the biggest thing that we determine from using it, it’s a large language model, it functionally is predicting the next word. It’s not intelligent, it’s not smart in any way.

It’s taken human encoding of data, which we’ve encoded as language, and then it’s learned to predict the next word, which winds up being a really good proxy for intelligence but is not intelligence itself. What happens because of that is GPT3 or ChatGPT will make up data because it is just predicting the next likely word – sometimes the next likely word is not factually correct, but is probabilistically correct from predicting the next word.

What’s a little scary about ChatGPT is that it writes so well that it can spew falsehoods in a very convincing way that if you don’t pay really detailed attention to, you actually can miss it. That’s maybe the scariest part.

It can be something as subtle as a negation. If you’re not really reading what it spits back, it might have done something as simple as negate, which should have been a positive statement. It might have turned a yes into a no, or it might have added an apostrophe to the end of something.

If you quickly read, your eyes will just glance over it and will not notice it, but it might be completely factually wrong. In some way, we’re suffering from an abundance of greatness. It’s gotten so good, it’s so amazing at writing that we actually now have the risk of the problem that the human evaluating it might actually miss, that what it wrote is factually incorrect just because it reads super well.

I think these systems are amazing; I think they’re fundamentally going to change the way a lot of machine learning and natural language processing work for a lot of people, and it’s just going to change how people interact.

With computers in general, I think the thing we should all be mindful of is it’s not a magical thing that just works out of the box, and it’s dangerous to actually assume that it is. If you want to use it for yourself, I strongly suggest that you fine-tune it.

If you’re going to try to use it out of the box and generate content for people or something like that, I strongly suggest you recommend to your customers that they review and read. And don’t just blindly share what they’re getting out of it because there is a reasonable chance that what’s in there may not be 100% correct.

Wrap up

Stephen: Awesome. Thanks, Jason. So that’s all from me.

Sabine: Yeah, thanks for the extra bonus comments on what is, I guess still like it’s convincing, but it’s just fabrication for now. So let’s see where it goes. But yeah, thanks, Jason, so much for coming on and sharing your expertise and your tips.

It was great having you.

Jason: Yes, thanks Stephen was really great. I enjoyed the conversation a lot.

Sabine: Before we let you go, how can people follow what you’re doing online? Maybe get in touch with you?

Jason: Yeah, so you can follow Xembly online at www.xembly.com. You can reach out to me. Just my first name, jason@xembly.com. If you want to ask me any questions, I’m happy to answer. Yeah, and just check out our website, see what’s happening. We try to keep people updated regularly.

Sabine: Awesome. Thanks very much. And here at mlops Live, we’ll be back in two weeks, as always. And next time, we’ll have with us, Silas Bempong and Abhijit Ramesh, we will be talking about doing MLOps for clinical research studies.

So in the meantime, see you on socials and the MLOps community slack. We’ll see you very soon. Thanks and take care.

Was the article useful?

More about Deploying Conversational AI Products to Production With Jason Flaks

Check out our product resources and related articles below:

MLOps Landscape in 2025: Top Tools and Platforms

A Comprehensive Guide on How to Monitor Your Models in Production

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

What is conversational AI?
What aspects of conversational AI is Xembly currently working on?
Scope of the conversational AI problem statements
Getting a conversational AI product ready
Solving different conversational AI challenges
Build vs. open-source: which conversational AI model should you choose?
Challenges of running complex conversational AI systems
How did Xembly set up the ML team?
Compute challenges and large language models (LLMs) in production
How do you ensure data quality when building NLP products?
What domains does the ML team work on?
Testing conversational AI products
Maintaining the balance between ML monitoring and testing
Open-source tools for conversational AI
Building conversational AI products for large-scale enterprises
Deploying conversational AI products on edge devices
Discussion on ChatGPT
Wrap up

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Deploying Conversational AI Products to Production With Jason Flaks

What is conversational AI?

What aspects of conversational AI is Xembly currently working on?

Chatbot

Automated note-taking

Scope of the conversational AI problem statements

Getting a conversational AI product ready

Solving different conversational AI challenges

Speech recognition

Speaker segmentation

Blind speaker segmentation

Format stage

Forking the ML pipeline

Build vs. open-source: which conversational AI model should you choose?

Challenges of running complex conversational AI systems

How did Xembly set up the ML team?

Compute challenges and large language models (LLMs) in production

How do you ensure data quality when building NLP products?

What domains does the ML team work on?

Testing conversational AI products

Maintaining the balance between ML monitoring and testing

Open-source tools for conversational AI

Building conversational AI products for large-scale enterprises

Deploying conversational AI products on edge devices

Discussion on ChatGPT

Wrap up

Was the article useful?

More about Deploying Conversational AI Products to Production With Jason Flaks

Check out our product resources and related articles below:

MLOps Landscape in 2025: Top Tools and Platforms

A Comprehensive Guide on How to Monitor Your Models in Production

Explore more content topics:

What is conversational AI?

What aspects of conversational AI is Xembly currently working on?

Chatbot

Automated note-taking

Scope of the conversational AI problem statements

Getting a conversational AI product ready

How to Deploy NLP Models in Production

Solving different conversational AI challenges

Speech recognition

Speaker segmentation

Blind speaker segmentation

Format stage

Forking the ML pipeline

Build vs. open-source: which conversational AI model should you choose?

Challenges of running complex conversational AI systems

How did Xembly set up the ML team?

Compute challenges and large language models (LLMs) in production

Deploying Large NLP Models: Infrastructure Cost Optimization

How do you ensure data quality when building NLP products?

What domains does the ML team work on?

Testing conversational AI products

Maintaining the balance between ML monitoring and testing

Open-source tools for conversational AI

MLOps Tools for NLP Projects

Building conversational AI products for large-scale enterprises

Deploying conversational AI products on edge devices

Discussion on ChatGPT

Wrap up

Was the article useful?

Check out our product resources and related articles below:

MLOps Landscape in 2025: Top Tools and Platforms

A Comprehensive Guide on How to Monitor Your Models in Production

Explore more content topics: