MLOps Blog

Learnings From Building the ML Platform at Mailchimp

Piotr Niedzwiedz , Aurimas Griciunas

34 min

19th December, 2023

MLOps

This article was originally an episode of the ML Platform Podcast, a show where Piotr Niedźwiedź and Aurimas Griciūnas, together with ML platform professionals, discuss design choices, best practices, example tool stacks, and real-world learnings from some of the best ML platform professionals.

In this episode, Mikiko Bazeley shares her learnings from building the ML Platform at Mailchimp.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 ML platform at Mailchimp and generative AI use cases
2 Generative AI problems at Mailchimp and feedback monitoring
3 Getting closer to the business as an MLOps engineer
4 Success stories of ML platform capabilities at Mailchimp
5 Golden paths at Mailchimp

Who is Mikiko Bazeley

Aurimas: Hello everyone and welcome to the Machine Learning Platform Podcast. Today, I’m your host, Aurimas, and together with me, there’s a cohost, Piotr Niedźwiedź, who is a co-founder and the CEO of neptune.ai.

With us today on the episode is our guest, Mikiko Bazeley. Mikiko is a very well-known figure in the data community. She is currently the head of MLOps at FeatureForm, a virtual feature store. Before that, she was building machine learning platforms at MailChimp.

Nice to have you here, Miki. Would you tell us something about yourself?

Mikiko Bazeley: You definitely got the details correct. I joined FeatureForm last October, and before that, I was with Mailchimp on their ML platform team. I was there before and after the big $14 billion acquisition (or something like that) by Intuit – so I was there during the handoff. Quite fun, quite chaotic at times.

But prior to that, I’ve spent a number of years working both as a data analyst, data scientist, and even a weird MLOps/ML platform data engineer role for some early-stage startups where I was trying to build out their platforms for machine learning and realize that’s actually very hard when you’re a five-person startup – lots of lessons learned there.

So I tell people honestly, I’ve spent the last eight years working up and down the data and ML value chain effectively – a fancy way of saying “job hopping.”

How to transition from data analytics to MLOps engineering

Piotr: Miki, you’ve been a data scientist, right? And later, an MLOps engineer. I know that you are not a big fan of titles; you’d rather prefer to talk about what you actually can do. But I’d say what you do is not a common combination.

How did you manage to jump from a more analytical, scientific type of role to a more engineering one?

Mikiko Bazeley: Most people are really surprised to hear that my background in college was not computer science. I actually did not pick up Python until about a year before I made the transition to a data scientist role.

When I was in college, I studied anthropology and economics. I was very interested in the way people worked because, to be frank, I didn’t understand how people worked. So that seemed like the logical area of study.

I was always fascinated by the way people made decisions, especially in a group. For example, what are cultural or social norms that we just kind of accept without too much thought? When I graduated college, my first job was working as a front desk girl at a hair salon.

At that point, I didn’t have any programming skills.

I think I had like one class in R for biostats, which I barely passed. Not because of intelligence or ambition, but mainly because I just didn’t understand the roadmap – I didn’t understand the process of how to make that kind of pivot.

My first pivot was to growth operations and sales hacking – it was called growth hacking at that time in Silicon Valley. And then, I developed a playbook for how to make these transitions. So I was able to get from growth hacking to data analytics, then data analytics to data science, and then data science to MLOps.

I think the key ingredients of making that transition from data science to an MLOps engineer were:

Having a really genuine desire for the kinds of problems that I want to solve and work on. That’s just how I’ve always focused my career – “What’s the problem I want to work on today?” and “Do I think it’s going to be interesting like one or two years from now?”

The second part was very interesting because there was one year I had four jobs. I was working as a data scientist, mentoring at two boot camps, and working on a real estate tech startup on the weekends.

I eventually left to work on it full-time during the pandemic, which was a great learning experience, but financially, it might not have been the best solution to get paid in sweat equity. But that’s okay – sometimes you have to follow your passion a little bit. You have to follow your interests.

Piotr: When it comes to decisions, in my context, I remember when I was still a student. I started from tech, my first job was an internship at Google as a software engineer.

I’m from Poland, and I remember when I got an offer from Google to join as a regular software engineer. The monthly salary was more than I was spending in a year. It was two or three times more.

It was very tempting to follow where money was at that moment. I see a lot of people in the field, especially at the beginning of their careers, thinking more short-term. The concept of looking a few steps, a few years ahead, I think it’s something that people are missing, and it’s something that, by the end of the day, may result in better outcomes.

I always ask myself when there is a decision like that; “What would happen if in a year it’s a failure and I’m not happy? Can I go back and pick up the other option?” And usually, the answer is “yes, you can.”

I know that decisions like that are challenging, but I think that you made the right call and you should follow your passion. Think about where this passion is leading.

Resources that can help bridge the technical gap

Aurimas: I also have a very similar background. I switched from analytics to data science, then to machine learning, then to data engineering, then to MLOps.

For me, it was a little bit of a longer journey because I kind of had data engineering and cloud engineering and DevOps engineering in between.

You shifted straight from data science, if I understand correctly. How did you bridge that – I would call it a technical chasm – that is needed to become an MLOps engineer?

Mikiko Bazeley: Yeah, absolutely. That was part of the work at the early-stage real estate startup. Something I’m a very big fan of is boot camps. When I graduated college, I had a very bad GPA – very, very bad.

I don’t know how they score a grade in Europe, but in the US, for example, it’s usually out of a 4.0 system, and I had a 2.4, and that is just considered very, very bad by most US standards. So I didn’t have the opportunity to go back to a grad program and a master’s program.

It was very interesting because by that point, I had approximately six years working with executive level leadership for companies like Autodesk, Teladoc, and other companies that are either very well known globally – or at least very, very well known domestically, within the US.

I had C-level people saying: “Hey, we will write you those letters to get into grad programs.”.

And grad programs were like, “Sorry, nope! You have to go back to college to redo your GPA.” And I’m like, “I’m in my late 20s. Knowledge is expensive, I’m not gonna do that.”

So I’m a big fan of boot camps.

What helped me both in the transition to the data scientist role and then also to the MLOps engineer role was doing a combination of boot camps, and when I was going to the MLOps engineer role, I also took this one workshop that’s pretty well-known called Full Stack Deep Learning. It’s taught by Dimitri and Josh Tobin, who went off to go start Gantry. I really enjoyed it.

I think sometimes people go into boot camps thinking that’s gonna get them a job, and it just really doesn’t. It’s just a very structured, accelerated learning format.

What helped me in both of those transitions was truly investing in my mentor relationship. For example, when I first pivoted from data analytics to data science, my mentor at that time was Rajiv Shah, who is the developer advocate at Hugging Face now.

I’ve been a mentor at boot camps since then – at a couple of them. A lot of times, students will kind of check-in and they’ll be like “Oh, why don’t you help me grade my project? How was my code?”

And that’s not a high-value way of leveraging an industry mentor, especially when they come with such credentials as Rajiv Shah came with.

With the full-stack deep learning course, there were some TAs there who were absolutely amazing. What I did was show them my project for grading. But for example, when moving to the data scientist role, I asked Rajiv Shah:

How do I do model interpretability if marketing, if my CMO is asking me to create a forecast, and predict results?
How do I get this model in production?
How do I get buy-in for these data science projects?
How do I leverage the strengths that I already have?

And I coupled that with the technical skills I’m developing.

I did the same thing with the ML platform role. I would ask:

What is this course not teaching me right now that I should be learning?
How do I develop my body of work?
How do I fill in these gaps?

I think I developed the skills through a combination of things.

You need to have a structured curriculum, but you also need to have projects to work with, even if they are sandbox projects – that kind of exposes you to a lot of the problems in developing ML systems.

Looking for boot camp mentors

Piotr: When you mention mentors, did you find them during boot camps or did you have other ways to find mentors? How does it work?

Mikiko Bazeley: With most boot camps, it comes down to picking the right one, honestly. For me,

I chose Springboard for my data science transition, and then I used them a little bit for the transition to the MLOps role, but I relied more heavily on the Full Stack Deep Learning course – and a lot of independent study and work too.

I didn’t finish the Springboard one for MLOps, because I’d gotten a couple of job offers by that point for four or five different companies for an MLOps engineer role.

Piotr: And was it because of the boot camp? Because you said, many people use boot camps to find jobs. How did it work in your case?

Mikiko Bazeley: The boot camp didn’t put me in contact with hiring managers. What I did do was, and this is where having public branding comes into play.

I definitely don’t think I’m an influencer. For one, I don’t have the audience size for that. What I try to do, very similar to what a lot of the folks here right now on the podcast do, is to try to share my learnings with people. I try to take my experiences and then frame them like “Okay, yes, these kinds of things can happen, but this is also how you can deal with it”.

I think building in public and sharing that learning was just so crucial for me to get a job. I see so many of these job seekers, especially on the MLOps side or the ML engineer side.

You see them all the time with a headline like: “data science, machine learning, Java, Python, SQL, or blockchain, computer vision.”

It’s two things. One, they’re not treating their LinkedIn profile as a website landing page. But at the end of the day, that’s what it is, right? Treat your landing page well, and then you might actually retain visitors, similar to a website or a SaaS product.

But more importantly, they’re not actually doing the important thing that you do with social networks, which is you have to actually engage with people. You have to share with folks. You have to produce your learnings.

So as I was going through the boot camps, that’s what I would essentially do. As I learned stuff and worked on projects, I would combine that with my experiences, and I would just share it out in public.

I would just try to be really – I don’t wanna say authentic, that’s a little bit of an overused term – but there’s the saying, “Interesting people are interested.” You have to be interested in the problems, the people, and the solutions around you. People can connect with that. If you’re just faking it like a lot of Chat GPT and Gen AI folks are – faking it with no substance – people can’t connect.

You need to have that real interest, and you need to have something with it. So that’s how I did that. I think most people don’t do that.

Piotr: There is one more factor that is needed. I’m struggling with it when it comes to sharing. I’m learning different stuff, but once I learn it, then it sounds kind of obvious, and then I’m kind of ashamed that maybe it’s too obvious. And then I just think: Let’s wait for something more sophisticated to share. And that never comes.

Mikiko Bazeley: The impostor syndrome.

Piotr: Yeah. I need to get rid of it.

Mikiko Bazeley: Aurimas, do you feel like you ever got rid of the impostor syndrome?

Aurimas: No, never.

Mikiko Bazeley: I don’t. I just find ways around it.

Aurimas: Everything that I post, I think it’s not necessarily worth other people’s time, but it looks like it is.

Mikiko Bazeley: It’s almost like you just have to set up things to get around your worst nature. All your insecurities – you just have to trick yourself like a good diet and workout.

What is FeatureForm, and different types of other feature stores

Aurimas: Let’s talk a little bit about your current work, Miki. You’re the Head of MLOps at FeatureForm. Once, I had a chance to talk with the CEO of FeatureForm and he left me with a good impression about the product.

What is FeatureForm? How is FeatureForm different from other players in the feature store market today?

Mikiko Bazeley: I think it comes down to understanding the different types of feature stores that are out there, and even understanding why a virtual feature store is maybe just a terrible name for what FeatureForm is category-wise; it’s not very descriptive.

There are three types of feature stores. Interestingly, they roughly correspond to the waves of MLOps and reflect how different paradigms have developed.

The three types are:

1 Literal,
2 Physical,
3 Virtual.

Most people understand literal feature stores intuitively. A literal feature store is literally just a feature store. It will store the features (including definitions and values) and then serve them. That’s pretty much all it does. It’s almost like a very specialized data storage solution.

For example, Feast. Feast is a literal feature store. It’s a very lightweight option you can implement easily, which means implementation risk is low. There’s essentially no transformation, orchestration, or computation going on

Piotr: Miki, if I may, why is it lightweight? I understand that a literal feature store stores features. It kind of replaces your storage, right?

Mikiko Bazeley: When I say lightweight, I mean kind of like implementing Postgres. So, technically, it’s not super lightweight. But if we compare it to a physical feature store and put the two on a spectrum, it is.

A physical feature store has everything:

It stores features,
It serves features,
It orchestrates features
It does the transformations.

In that respect, a physical feature store is heavyweight in terms of implementation, maintenance, and administration.

Piotr: On the spectrum, the physical feature store is the heaviest?

And in the case of a literal feature store, the transformations are implemented somewhere else and then saved?

Mikiko Bazeley: Yes.

Aurimas: And the feature store itself is just a library, which is basically performing actions against storage. Correct?

Mikiko Bazeley: Yes, well, that’s almost an implementation detail. But yeah, for the most part. Feast, for example, is a library. It comes with different providers, so you do have a choice.

Aurimas: You can configure it against S3, DynamoDB, or Redis, for example. The weightiness, I guess comes from it being just a thin library on top of this storage, and you manage the storage yourself.

Mikiko Bazeley: 100%.

Piotr: So there is no backend? There’s no component that stores metadata about this feature store?

Mikiko Bazeley: In the case of the literal feature store, all it does is store features and metadata. It won’t actually do any of the heavy lifting of the transformation or the orchestration.

Piotr: So what is a virtual feature store, then? I understand physical feature stores, this is quite clear to me, but I’m curious what a virtual feature store is.

Mikiko Bazeley: Yeah, so in the virtual feature store paradigm, we attempt to take the best of both worlds.

There is a use case for the different types of feature stores. The physical feature stores came out of companies like Uber, Twitter, Airbnb, etc. They were solving really gnarly problems when it came to processing huge amounts of data in a streaming fashion.

The challenges with physical feature stores is that you’re pretty much locked down to your provider or the provider they choose. You can’t actually swap it out. For example, if you wanted to use Cassandra or Redis as your, what we call the “inference store” or the “online store,” you can’t do that with a physical feature store. Usually, you just take whatever providers they give you. It’s almost like a specialized data processing and storage solution.

With the virtual feature store, we try to take the flexibility of a literal feature store where you can swap out providers. For example, you can use BigQuery,

AWS, or Azure. And if you want to use different inference stores, you have that option.

What virtual feature stores do is focus on the actual problems that feature stores are supposed to solve, which is not just versioning, not just documentation and metadata management, and not just serving, but also the orchestration of transformations.

For example, at FeatureForm, we do this because we are Kubernetes native. We’re assuming that data scientists, for the most part, don’t want to write transformations elsewhere. We assume that they want to do stuff they normally would, with Python, SQL, and PySpark, with data frames.

They just want to be able to, for example, wrap their features in a decorator or write them as a class if they want to. They shouldn’t have to worry about the infrastructure side. They shouldn’t have to provide all this fancy configuration and have to figure out what the path to production is – we try to make that as streamlined and simple as possible.

The idea is that you have a new data scientist that joins the team…

Everyone has experienced this: you go to a new company, and you basically just spend the first three months trying to look for documentation in Confluence. You’re reading people’s Slack channels to be clear on what exactly they did with this forecasting and churn project.

You’re hunting down the data. You find out that the queries are broken, and you’re like “God, what were they thinking about this?”

Then a leader comes to you, and they’re like, “Oh yeah, by the way, the numbers are wrong. You gave me these numbers, and they’ve changed.” And you’re like, “Oh shoot! Now I need lineage. Oh God, I need to track.”

The part that really hurts a lot of enterprises right now is regulation. Any company that does business in Europe has to obey GDPR, that’s a big one. But a lot of medical companies in the US, for example, are under HIPAA, which is for medical and health companies. So for a lot of them, lawyers are very involved in the ML process. Most people don’t realize this.

In the enterprise space, lawyers are the ones who, for example, when they are faced with a lawsuit or a new regulation comes out, they need to go, “Okay, can I track what features are being used and what models?” So those kinds of workflows are the things that we’re really trying to solve with the virtual feature store paradigm.

It’s about making sure that when a data scientist is doing feature engineering, which is really the most heavy and intensive part of the data science process, they don’t have to go to all these different places and learn new languages when the feature engineering is already so hard.

Virtual feature store in the picture of a broader architecture

Piotr: So Miki, when we look at it from two perspectives. From an administrator’s perspective. Let’s say we are going to deploy a virtual feature store as a part of our tech stack, I need to have storage, like S3 or BigQuery. I would need to have the infrastructure to perform computations. It can be a cluster run by Kubernetes or maybe something else. And then, the virtual feature store is an abstraction on top of storage and a compute component.

Mikiko Bazeley: Yeah, so we actually did a talk at Data Council. We had released what we call a “market map,” but that’s not actually quite correct. We had released a diagram of what we think the ML stack, the architecture should look like.

The way we look at it is that you have computation and storage, which are just things that run across every team. These are not what we call layer zero, layer one. These are not necessarily ML concerns because you need computation and storage to run an e-commerce website. So, we’ll use that e-commerce website as an example.

The layer above that is where you have the providers or, for a lot of folks – if you’re a solo data scientist, for example –maybe you just need access to GPUs for machine learning models. Maybe you really like to use Spark, and you have your other serving providers at that layer. So here’s where we start seeing a little bit of the differentiation for ML problems.

Underneath that, you might also have Kubernetes, right? Because that also might be doing the orchestration for the full company. So the virtual feature store goes above your Spark, Inray, and your Databricks offering, for example.

Now, above that though, and we’re seeing this now with, for example, the midsize space, there’s a lot of folks who’ve been publishing amazing descriptions of their ML system. For example, Shopify published a blog post about Merlin. There are a few other folks, I think DoorDash has also published some really good stuff.

But now, people are also starting to look at what we call these unified MLOps frameworks. That’s where you have your ZenML, and a few others that are in that top layer. The virtual feature store would fit in between your unified MLOps framework and your providers like Databricks, Spark, and all that. Below that would be Kubernetes and Ray.

Virtual feature stores from an end-user perspective

Piotr: All this was from an architectural perspective. What about the end-user perspective? I assume that when it comes to the end-users of the feature store, at least one of the personas will be a data scientist. How will a data scientist interact with the virtual feature store?

Mikiko Bazeley: So ideally, the interaction would be, I don’t wanna say it would be minimal. But you would use it to the extent that you would use Git. Our principle is to make it really easy for people to do the right thing.

Something I learned when I was at Mailchimp from the staff engineer and tech lead for my team was to assume positive intent – which I think is just such a lovely guiding principle. I think a lot of times there’s this weird antagonism between ML/MLOps engineers, software engineers, and data scientists where it’s like, “Oh, data scientists are just terrible at coding. They’re terrible people. How awful are they?”

Then data scientists are looking at the DevOps engineers or the platform engineers going, “Why do you constantly create really bad abstractions and really leaky APIs that make it so hard for us to just do our job?” Most data scientists just do not care about infrastructure.

And if they do care about infrastructure, they are just MLOps engineers in training. They’re on the step to a new journey.

Every MLOps engineer can tell a story that goes like, “Oh God, I was trying to debug or troubleshoot a pipeline,” or “Oh God, I had a Jupyter notebook or a pickled model, and my company didn’t have the deployment infrastructure.” I think that’s the origin story of every caped MLOps engineer.

In terms of the interaction, ideally, the data scientists shouldn’t have to be setting up infrastructure like a Spark cluster. What they do need is they just the credential information, which should be, I don’t wanna say fairly easy to get, but if it’s really hard for them to get it from their platform engineers, then that is maybe a sign of some deeper communication issues.

But all they would just need to get is the credential information, put it in a configuration file. At that point, we use the term “registering” at FeatureForm, but essentially it’s mostly through decorators. They just need to kind of tag things like “Hey, by the way, we’re using these data sources. We’re creating these features. We’re creating these training datasets.” Since we offer versioning and we say features are a first-class immutable entity or citizen, they also provide a version and never have to worry about writing over features or having features of the same name.

Let’s say you have two data scientists working on a problem.

They’re doing a forecast for customer lifetime value for our e-commerce example. And maybe it’s “money spent in the first three months of the customer’s journey” or what campaign they came through. If you have two data scientists working on the same logic, and they both submit, as long as the versions are named differently, both of them will be logged against that feature.

That allows us to also provide the tracking and lineage. We help materialize the transformations, but we won’t actually store the data for the features.

Dataset and feature versioning

Piotr: Miki, a question because you used the term “decorator.” The only decorator that comes to my mind is a Python decorator. Are we talking about Python here?

Mikiko Bazeley: Yes!

Piotr: You also mentioned that we can version features, but when it comes to that, conceptually a data set is a set of samples, right? And a sample consists of many features. Which leads me to the question if you would also version datasets with a feature store?

Mikiko Bazeley: Yes!

Piotr: So what is the glue between versioned features? How can we represent datasets?

Mikiko Bazeley: We don’t version datasets. We’ll version sources, which also include features, with the understanding that you can use features as sources for other models.

You could use FeatureForm with a tool like DVC. That has come up multiple times. We’re not really interested in versioning full data sets. For example, for sources, we can take tables or files. If people made modifications to that source or that table or that file, they can log that as a variation. And we’ll keep track of those. But that’s not really the goal.

We want to focus more on the feature engineering side. And so what we do is version the definitions. Every feature consists of two components. It’s the values and the definition. Because we create these pure functions with FeatureForm, the idea is that if you have the same input and you push it through the definitions that we’ve stored for you, then we will transform it, and you should ideally get the same output.

Aurimas: If you plug a machine learning pipeline after a feature store and you retrieve a dataset, it’s already a pre-computed set of features that you saved in your feature store. For this, you’d probably need to provide a list of entity IDs, just like all other feature stores require you to do, correct? So you would version this entity ID list plus the computation logic, such that the feature you versioned plus the source equals a reproducible chunk.

Would you do it like this, or are there any other ways to approach this?

Mikiko Bazeley: Let me just repeat the question back to you:

Basically, what you’re asking is, can we reproduce exact results? And how do we do that?

Aurimas: For a training run, yeah.

Mikiko Bazeley: OK. That goes back to a statement I made earlier. We don’t version the dataset or the data input. We version the transformations. In terms of the actual logic itself, people can register individual features, but they can also zip those features together with a label.

What we guarantee is that whatever you write for your development features, the same exact logic will be mirrored for production. And we do that through our serving client. In terms of guaranteeing the input, that’s where we as a company say, “Hey, you know, there’s so many tools to do that.”

That’s kind of the philosophy of the virtual feature store. A lot of the early waves of MLOps were solving the lower layers, like “How fast can we make this?”, “What’s the throughput?”, “What’s the latency?” We don’t do that. For us, we’re like, “There’s so many great options out there. We don’t need to focus on that.”

Instead, we focus on the parts that we’ve been told are really difficult. For example, minimizing train and serve skew, and specifically, minimizing it through standardizing the logic that’s being used so that the data scientist isn’t writing their training pipeline in the pipeline and then has to rewrite it in Spark, SQL, or something like that. I don’t want to say that this is a guarantee for reproducibility, but that’s where we try to at least help out a lot.

With regard to the entity ID: We get the entity ID, for example, from the front end team as an API call. As long as the entity IDis the same as the feature or features they’re calling is the right version, they should get the same output.

And that’s some of the use cases people have told us about. For example, if they want to test out different kinds of logic, they could:

create different versions of the features,
create different versions of the training sets,
feed one version of the data to different models

They can do ablation studies to see which model performed well and which features did well and then roll it back to the model that performed best.

The value of feature stores

Piotr: To sum up, would you agree that when it comes to the value that a feature store brings to the tech stack of an ML team, it brings versioning of the logic behind feature engineering?

If we have versioned logic for a given set of features that you want to use to train your model and you would save somewhere a pointer or to the source data that will be used to compute specific features, then what we are getting is basically dataset versioning.

So on one hand you need to have the source data, and you need to version it somehow, but also you need to version the logic to process the raw data and compute the features.

Mikiko Bazeley: I’d say the three or four main points of the value proposition are definitely versioning of the logic. The second part is documentation, which is a huge part. I think everyone has had the experience where they look at a project and have no idea why someone chose the logic that they did. For example, logic to represent a customer or a contract value in a sales pipeline.

So versioning, documentation, transformation, and orchestration. The way we say it is you “ write once, serve twice.” We offer that guarantee. And then, along with the orchestration aspect, there’s also things like scheduling. But those are the three main things:

Versioning,
Documentation,
Minimizing train service skew through transformations.

Those are the three big ones that people ask us for.

Feature documentation in FeatureForm

Piotr: How does documentation work?

Mikiko Bazeley: There are two types of documentation. There is, I don’t want to say incidental documentation, but there is documenting through code and assistive documentation.

For example, assistive documentation is, for example, docstrings. You can explain, “Hey, this is the logic of the function, this is what the terms mean, etc.. We offer that.

But then there is also documenting through code as much as possible. For example, you have to list the version of the feature or the training set, or the source that you’re using. Trying to break out the type of the resource that’s being created as well. At least for the managed version of FeatureForm, we also offer governance, user access control, and things like that. We also offer lineage of the features. For example, linking a feature to the model that’s being used with it. We try to build in as much documentation through code as possible .

We’re always looking at different ways we can continue to expand the capabilities of our dashboard to assist with the assistive documentation. We’re also thinking of other ways that different members of the ML lifecycle or the ML team – both the ones that are obvious, like the MLOps engineer, data scientists, but also the non-obvious people, like lawyers, can have visibility and access into what features are being used and with what models. Those are the different kinds of documentation that we offer.

ML platform at Mailchimp and generative AI use cases

Aurimas: Before joining FeatureForm as the head of MLOps, you were a machine learning operations engineer at Mailchimp, and you were helping to build the ML platform there, right? What kind of problems were the data scientists and machine learning engineers solving at Mailchimp?

Mikiko Bazeley: There were a couple of things. When I joined Mailchimp, there was already some kind of a platform team there. It was a very interesting situation, where the MLOps and the ML Platform concerns were roughly split across three teams.

There was the team that I was on, where we were very intensely focused on making tools and setting up the environment for development and training for data scientists, as well as helping out with the actual productionization work.
There was a team that was focused on serving the live models.
And there was a team that was constantly evolving. They started off as doing data integrations, and then became the ML monitoring team. That’s kind of where they’ve been since I left.

Generally speaking, across all teams, the problem that we were trying to solve was: How do we provide passive productionization for data scientists at Mailchimp, given all the different kinds of projects they were working on.

For example, Mailchimp was the first place I had seen where they had a strong use case for business value for generative AI. Anytime a company comes out with generative AI capabilities, the company I benchmark them against is Mailchimp – just because they had such a strong use case for it.

Aurimas: Was it content generation?

Mikiko Bazeley: Oh, yeah, absolutely. It’s helpful to understand what Mailchimp is for additional context.

Mailchimp is a 20-year-old company. It’s based in Atlanta, Georgia. Part of the reason why it was bought out for so much money was because it’s also the largest… I don’t want to say provider. They have the largest email list in the US because they started off as an email marketing solution. But what most people, I think, are not super aware of is that for the last couple of years, they have been making big moves into becoming sort of like the all-in-one shop for small, medium-sized businesses who want to do e-commerce.

There’s still email marketing. That’s a huge part of what they do, so NLP is very big there, obviously. But they also offer things like social media content creation, e-commerce virtual digital websites etc. They essentially tried to position themselves as the front-end CRM for small and medium-sized businesses. They were bought by Intuit to become the front-end of Intuit’s back-of-house operations, such as QuickBooks and TurboTax.

With that context, the goal of Mailchimp is to provide the marketing stuff. In other words, the things that the small mom-and-pop businesses need to do. Mailchimp seeks to make it easier and to automate it.

One of the strong use cases for generative AI they were working on was this: Let’s say you’re a small business owner running a t-shirt or a candle shop. You are the sole proprietor, or you might have two or three employees. Your business is pretty lean. You don’t have the money to afford a full-time designer or marketing person.

You can go to Fiverr, but sometimes you just need to send emails for holiday promotions.

Although that’s low-value work, if you were to hire a contractor to do that, it would be a lot of effort and money. One of the things Mailchimp offered through their creative studio product or services, I forgot the exact name of it, was this:

Then Leslie goes, “Hey, okay, now, give me some templates

Say, Leslie of the candle shop wants to send that holiday email. What she can do is go into the creative studio and say, “Hey, here’s my website or shop or whatever, generate a bunch of email templates for me.” The first thing it would do is to generate stock photos and the color palettes for your email.

Then Leslie goes, “Hey, okay, now, give me some templates to write my holiday email, but do it with my brand in mind,” so her tone of voice, her speaking style. It then lists other kinds of details about her shop. Then, of course, it would generate the email copy. Next, Leslie says, “Okay, I want several different versions of this so I can A/B test the email.” Boom! It would do that…

The reason why I think this is such a strong business use case is because Mailchimp is the largest provider. I intentionally don’t say provider of emails because they don’t provide emails, they –

Piotr: … the sender?

Mikiko Bazeley: Yes, they are the largest secure business for emails. So Leslie has an email list that she’s already built up. She can do a couple of things. Her email list is segmented out – that’s also something Mailchimp offers. Mailchimp allows users to create campaigns based on certain triggers that they can customize on their own. They offer a nice UI for that. So, Leslie has three email lists. She has high spenders, medium spenders, and low spenders.

She can connect the different email templates with those different lists, and essentially, she’s got that end-to-end automation that’s directly tied into her business. For me, that was a strong business value proposition. A lot of it is because Mailchimp had built up a “defensive moat” through the product and their strategy that they’ve been working on for 20 years.

For them, the generative AI capabilities they offer are directly in line with their mission statement. It’s also not the product. The product is “we’re going to make your life super easy as a small or medium sized business owner who might’ve already built up a list of 10,000 emails and has interactions with their website and their shop”. Now, they also offer segmentation and automation capabilities – you normally have to go to Zapier or other providers to do that.

I think Mailchimp is just massively benefiting from the new wave. I can’t say that for a lot of other companies. Seeing that as an ML platform engineer when I was there was super exciting because it also exposed me early on to some of the challenges of working with not just multi-model ensemble pipelines, which we had there for sure, but also testing and validating generative AI or LLMs.

For example, if you have them in your system or your model pipeline, how do you actually evaluate it? How do you monitor it? The big thing that a lot of teams get super wrong is actually the data product feedback on their models.

Companies and teams really don’t understand how to integrate that to further enrich their data science machine learning initiatives and also the products that they’re able to offer.

Piotr: Miki, the funny conclusion is that the greetings we are getting from companies during holidays are not only not personalized, but also even the body of the text is not written by a person.

Mikiko Bazeley: But they are personalized. They’re personalized to your persona.

Generative AI problems at Mailchimp and feedback monitoring

Piotr: That’s fair. Anyways, you said something very interesting: “Companies don’t know how to treat feedback data,” and I think with generative AI type of problems, it is even more challenging because the feedback is less structured.

Can you share with us how it was done at Mailchimp? What type of feedback was it, and what did your teams do with it? How did it work?

Mikiko Bazeley: I will say that when I left, the monitoring initiatives were just getting off the ground. Again, it’s helpful to understand the context with Mailchimp. They’re a 20-year-old, privately owned company that never had any VC funding.

They still have physical data centers that they rent, and they own server racks. They had only started transitioning to the cloud a relatively short time ago – maybe less than eight years ago or closer to six.

This is a great decision that maybe some companies should think about. Rather than moving the entire company to the cloud, Mailchimp said, “For now, what we’ll do is we’ll move the burgeoning data science and machine learning initiatives, including any of the data engineers that are needed to support those. We’ll keep everyone else in the legacy stack for now.”

Then, they slowly started migrating shards to the cloud and evaluated that. Since they were privately owned and had a very clear north star, they were able to make technology decisions in terms of years as opposed to quarters – unlike some tech companies.

What does that mean in terms of the feedback? It means there’s feedback that’s generated through the product data that is serviced back up into the product itself – a lot of that was in the core legacy stack.

The data engineers for the data science/machine learning org were mainly tasked with bringing over data and copying data from the legacy stack over into GCP, which was where we were living. The stack of the data science/machine learning folks on GCP was BigQuery, Spanner, Dataflow, and AI Platform Notebooks, which is now Vertex. We were also using Jenkins, Airflow, Terraform, and a couple of others.

But the big role of the data engineers there was getting that data over to the data science and machine learning side. For the data scientists and machine learning folks, there was a latency of approximately one day for the data.

At that point, it was very hard to do things. We could do live service models – which was a very common pattern – but a lot of the models had to be trained offline. We created a live service out of them, exposed the API endpoint, and all that. But there was a latency of about one to two days.

With that being said, something they were working on, for example, was… and this is where the tight integration with product needs to happen.

One feedback that had been given was about creating campaigns – what we call the “journey builder.” A lot of owners of small and medium sized businesses are the CEO, the CFO, the CMO, they’re doing it all. They’re like, “This is actually complicated. Can you suggest l how to build campaigns for us?” That was feedback that came in through the product.

The data scientist in charge of that project said, “I’m going to build a model that will give a recommendation for the next three steps or the next three actions an owner can take on their campaign.” Then we all worked with the data engineers to go, “Hey, can we even get this data?”

Once again, this is where legal comes into play and says:, “Are there any legal restrictions?” And then essentially getting that into the datasets that could be used in the models.

Piotr: This feedback is not data but more qualitative feedback from the product based on the needs users express, right?

Mikiko Bazeley: But I think you need both.

Aurimas: You do.

Mikiko Bazeley: I don’t think you can have data feedback without product and front-end teams. For example, a very common place to get feedback is when you share a recommendation, right? Or, for example, Twitter ads.

You can say, “Is this ad relevant to you?” It’s yes or no. This makes it very simple to offer that option in the UI. And I think a lot of folks think that the implementation of data feedback is very easy. When I say “easy”, I don’t mean that it doesn’t require a strong understanding of experimentation design. But assuming you have that, there are lots of tools like A/B tests, predictions, and models. Then, you can essentially just write the results back to a table. That’s not actually hard. What is hard a lot of times is getting the different engineering teams to sign on to that, to even be willing to set that up.

Once you have that and you have the experiment, the website, and the model that it was attached to, the data part is easy, but I think getting the product buy-in and getting the engineering or the business team on board with seeing there’s a strategic value in enriching our datasets is hard.

For example, when I was at Data Council last week, they had a generative AI panel. What I got out of that discussion was that boring data and ML infrastructure matter a lot. They matter even more now.

A lot of this MLOps infrastructure is not going to go away. In fact, it becomes more important. The big discussion there was like, “Oh, we are running out of the public corpus of data to train and fine-tune on.” And what they mean by that is we’re running out of high-quality academic data sets in English to use our models with. So people are like, “Well, what happens if we run out of data sets on the web?” And the answer is it goes back to first-party data – it goes back to the data that you, as a business, actually own and can control.

It was the same discussion that happened when Google said, “Hey, we’re gonna get rid of the ability to track third-party data.” A lot of people were freaking out. If you build that data feedback collection and align it with your machine learning efforts, then you won’t have to worry. But if you’re a company where you’re just a thin wrapper around something like an OpenAI API, then you should be worried because you’re not delivering value no one else could offer.

It’s the same with the ML infrastructure, right?

Getting closer to the business as an MLOps engineer

Piotr: The baseline just went up, but to be competitive, to do something on top, you still need to have something proprietary.

Mikiko Bazeley: Yeah, 100%. And that’s actually where I believe MLOps and data engineers think too much like engineers…

Piotr: Can you elaborate more on that?

Mikiko Bazeley: I don’t want to just say they think the challenges are technical. A lot of times there are technical challenges. But, a lot of times, what you need to get is time, headroom, and investment. A lot of times, that means aligning your conversation with the strategic goals of the business.

I think a lot of data engineers and MLOps engineers are not great with that. I think data scientists oftentimes are better at that.

Piotr: That’s because they need to deal with the business more often, right?

Mikiko Bazeley: Yeah!

Aurimas: And the developers are not directly providing value…

Mikiko Bazeley: It’s like public health, right? Everyone undervalues public health until you’re dying of a water contagion issue. It’s super important, but people don’t always surface how important it is. More importantly, they approach it from a “this is the best technical solution” perspective as opposed to “this will drive immense value for the company.” Companies really care only about two or three things:

1 Generating more revenue or profit
2 Cut cost or optimize them
3 A combination of both of the above.

If MLOps and data engineers can align their efforts, especially around building an ML stack, a business person or even the head of engineering is going to be like, “Why do we need this tool? It’s just another thing people here are not gonna be using.”

The strategy to kind of counter that is to think about what KPIs and metrics they care about. Show the impact on those. The next part is also offering a plan of attack, and a plan for maintenance.

The thing I’ve observed extremely successful ML platform teams do is the opposite of the stories you hear about. A lot of stories you hear about building ML platforms go like, “We created this new thing and then we brought in this tool to do it. And then people just used it and loved it.” This is just another version of, “if you build it, they will come,” and that’s just not what happens.

You have to read between the lines of the story of a lot of successful ML platforms. What they did was to take an area or a stage of the process that was already in motion but wasn’t optimal. For example, maybe they already had a path to production for deploying machine learning models but it just really sucked.

What teams would do is build a parallel solution that was much better and then invite or onboard the data scientists to that path. They would do the manual stuff associated with adopting users – it’s the whole “do things that don’t scale,” you know. Do workshops.Help them get their project through the door.

The key point is that you have to offer something that is actually truly better. When data scientists or users have a baseline of, “We do this thing already, but it sucks,” and then you offer them something better – I think there’s a term called “differentiable value” or something like that – you essentially have a user base of data scientists that can do more things.

If you go to a business person or your CTO and say, “We already know we have 100 data scientists that are trying to push models. This is how long it’s taking them. Not only can we cut that time down to half, but we can also do it in a way where they’re happier about it and they’re not going to quit. And it’ll provide X amount more value because these are the initiatives we want to push. It’s going to take us about six months to do it, but we can make sure we can cut down to three months.” Then you can show those benchmarks and measurements as well as offer a maintenance plan.

A lot of these conversations are not about technical supremacy. It’s about how to socialize that initiative, how to align it with your executive leaders’ concerns, and do the hard work of getting the adoption of the ML platform.

Success stories of the ML platform capabilities at Mailchimp

Aurimas: Do you have any success stories from Mailchimp? What practices would you suggest in communicating with machine learning teams? How do you get feedback from them?

Mikiko Bazeley: Yeah, absolutely. There’s a couple of things we did well. I’ll start with Autodesk for context.

When I was working at Autodesk I was in a data scientist/data analyst hybrid role. Autodesk is a design-oriented company. They make you take a lot of classes like design thinking and about how to collect user stories. That’s something I had also learned in my anthropology studies:How do you create what they call ethnographies, which is like, “How do you go to people, learn about their practices, understand what they care about, speak in their language.”

That was the first thing that I did there on the team. I landed there and was like, “Wow, we have all these tickets in Jira. We have all these things we could be working on.” The team was working in all these different directions, and I was like, “Okay, first off, let’s just make sure we all have the same baseline of what’s really important.”

So I did a couple of things.The first was to go back through some of the tickets we had created. I went back through the user stories, talked to the data scientists, talked to the folks on the ML platform team, created a process to gather this feedback. Let’s all independently score or group the feedback and let’s “t-shirt size” the efforts. From there, we could establish a rough roadmap or plan after that.

One of the things we identified was templating. The templating was a little bit confusing. More importantly, this is around the time the M1 Mac was released. It had broken a bunch of stuff for Docker. Part of the templating tool was essentially to create a Docker image and to populate it with whatever configurations based on the type of machine learning project they were doing.What we wanted to get away from was local development.

All of our data scientists were doing work in our AI Platform notebooks. And then they would have to pull down the work locally,then they would have to push that work back to a separate GitHub instance and all this sorts of stuff. We wanted to really simplify this process as much as possible and specifically wanted to find a way to connect the AI Platform notebook.

You would create a template within GCP, which you then could push out to GitHub, which then would trigger the CI/CD, and then also eventually trigger the deployment process. That was a project I worked on. And it looks like it did help. I worked on the V1 of that, and then additional folks took it, matured it even further. Now, data scientists ideally don’t have to go through that weird weird push-pull from remote to local during development.

That was something that to me was just a really fun project because I kind of had

this impression of data scientists, and even in my own work, that you develop locally.But it was a little bit of a disjointed process. There was a couple of other stuff too. But that back-and-forth between remote and local development was the big one. That was a hard process too, because we had to think about how to connect it to Jenkins and then how to get around the VPC and all that.

A book that I’ve been reading recently that I really love is called “Kill It With Fire” by Marianne Bellotti. It’s about how to update legacy systems, how to modernize them without throwing them away. That was a lot of the work I was doing at Mailchimp.

Up until this point in my career, I was used to working at startups where the ML initiative was really new and you had to build everything from scratch. I hadn’t understood that when you’re building an ML service or tool for an enterprise company, it’s a lot harder. You have a lot more constraints on what you can actually use.

For example, we couldn’t use GitHub Actions at Mailchimp. That would have been nice, but we couldn’t. We had an existing templating tool and a process that data scientists already were using. It existed, but it was suboptimal. So how would we optimize an offering that they would be willing to actually use? A lot of learnings from it, but the pace in an enterprise setting is a lot slower than what you could do either at a startup or even as a consultant. So that’s the one drawback.A lot of times the number of projects you can work on is about a third than if you’re someplace else, but it was very fascinating.

Team structure at Mailchimp

Aurimas: I’m very interested to learn whether the data scientists were the direct users of your platform or if there were also machine learning engineers involved in some way – maybe embedded into the product teams?

Mikiko Bazeley: There’s two answers to that question. Mailchimp had a design- and engineering-heavy culture. A lot of the data scientists who worked there, especially the most successful ones, had prior experience as software engineers. Even if the process was a little bit rough, a lot of times they were able to find ways to kind of work with it.

But, in the last two, three years, Mailchimp started hiring data scientists that were more on the product and business side. They didn’t have experience as software engineers. This meant they needed a little bit of help. Thus, each team that was involved in MLOps or the ML platform initiatives had what we called “embedded MLOps engineers.

They were kind of close to an ML engineering role, but not really. For example, they weren’t building the models for data scientists. They were literally only helping with the last mile to production. The way I usually like to think of an ML engineer is as a full-stack data scientist. This means they’re writing up features and developing the models. We had folks that were just there to help the data scientists get their project through the process, but they weren’t building the models.

Our core users were data scientists, and they were the only ones. We had folks that would help them out with things such as answering tickets, Slack questions, and helping to prioritize bugs. That would then be brought back to the engineering folks that would work on it. Each team had this mix of people that would focus on developing new features and tools and people that had about 50% of their time assigned to helping the data scientists.

Intuit had acquired Mailchimp about six months before I left, and it usually takes about that long for changes to actually start kicking in. I think what they have done is to restructure the teams so that a lot of the enablement engineers were nowon one team and the platform engineers were on another team. But before, while I was there, each team had a mix of both.

Piotr: So there was no central ML platform team?

Mikiko Bazeley: No. It was essentially split along training and development, and then serving, and then monitoring and integrations.

Aurimas: It’s still a central platform team, but made up of multiple streamlined teams. They’re kind of part of a platform team, probably providing platform capabilities, like in team topologies.

Mikiko Bazeley: Yeah, yeah.

Piotr: Did they share a tech stack and processes or did each ML team with data scientists and support people have their own realm, own tech stack, own processes. Or did you have initiatives to share some basics, for example, you mentioned templates being used across teams.

Mikiko Bazeley: Most of the stack was shared. I think the team topologies way of describing teams in organizations is actually fantastic. It’s a fantastic way to describe it. Because there were four teams, right? There’s the streamlined teams, which in this case is data science and product. You have complicated subsystem teams, which are the Terraform team, or the Kubernetes team, for example. And then you have enablement and platform.

Each team was a mix of platform and enablement. For example, the resources that we did share were BigQuery, Spanner, and Airflow. But the difference is, and I think this is something that I think a lot of platform teams actually miss: he goal of the platform team isn’t always to own a specific tool, or a specific layer of the stackA lot of times, if you are so big that you have those specializations, the goal of the platform team is to piece together not just the existing tool, but occasionally also bring new tools into a unified experience for your end user – which for us were the data scientists. Even though we shared BigQuery, Airflow, and all that great stuff, other teams were using those resources as well. But they might not be interested, for example, in deploying machine learning models to production. They might not actually be involved in that aspect at all.

What we did was to say, “Hey, we’re going to essentially be your guides to enable these other internal tools. We’re going to create and provide abstractions.” Occasionally, we would also bring in tools that we thought were necessary. For example, a tool that was not used by the serving team was Great Expectations. They didn’t really touch that because it’s something that you would mostly use in development and training – you wouldn’t really use great expectations in production.

There were a couple of other things too… Sorry. I can’t think of them all off the top of my head, but there were three or four other tools the data scientists needed to use in development and training, but they didn’t need them for production. We would incorporate those tools into the paths to production.

The serving layer was a thin Python client that would take the Docker containers or images that were being used for the models. It was then exposed to the API endpoint so that teams up front could route any of the requests to get predictions from the models.

Aside

neptune.ai is an experiment tracker for ML teams that struggle with debugging and reproducing experiments, sharing results, and messy model handover.

It offers a single place to track, compare, store, and collaborate on experiments so that Data Scientists can develop production-ready models faster and ML Engineers can access model artifacts instantly in order to deploy them to production.

The pipelining stack

Piotr: Did you use any pipelining tools? For instance, to allow automatic or semi-automatic retraining of models. Or would data scientists just train a model, package it into a Docker image and then it was kind of closed?

Mikiko Bazeley: We had projects that were in various stages of automation. Airflow was a big tool that we used. That was the one that everyone in the company used across the board. The way we interacted with Airflow was as follows: With Airflow, a lot of times you have to go and write your own DAG and create it. Quite often, that can actually be automated, especially if it’s just running the same type of machine learning pipeline that was built into the cookiecutter template. So we said, “Hey, when you’re setting up your project, you go through a series of interview questions. Do you need Airflow? Yes or no?” If they said “yes”, then that part would get filled out for them with the relevant information on the project and all that other stuff. And then it would substitute in the credentials.

Piotr: How did they know whether they needed it or not?

Mikiko Bazeley: That is actually something that was part of the work of optimizing the cookiecutter template. When I first got there, data scientists had to fill out a lot of these questions. Do I need Airflow? Do I need XYZ? And for the most part, a lot of times they would have to ask the enablement engineers “Hey, what should I be doing?”

Sometimes there were projects that needed a little bit more of a design consultation, like “Can we support this model or this system that you’re trying to build with the existing paths that we offer?” And then we would help them figure that out, so that they could go on and set up the project.

It was a pain when they would set up the project and then we’d look at it and go, “No, this is wrong. You actually need to do this thing.” And they would have to rerun the project creation. Something that we did as part of the optimization was to say, “Hey, just pick a pattern and then we’ll fill out all the configurations for you”. Most of them could figure it out pretty easily. For example, “Is this going to be a batch prediction job where I just need to copy values? Is this going to be a live service model?” Those two patterns were pretty easy for them to figure out, so they could go ahead and say, “Hey, this is what I want.” They could just use the image that was designed for that particular job.

The template process would run, and then they could just fill it out., “Oh, this is the project name, yada, yada…” They didn’t have to fill out the Python version. We would automatically set it to the most stable, up-to-date version, but if they needed version 3.2 and Python’s at 3.11, they would specify that. Other than that, ideally, they should be able to do their jobs of writing the features and developing the models.

The other cool part was that we had been looking at offering them native Streamlit support. That was a common part of the process as well. Data scientists would create the initial models. And then they would create a Streamlit dashboard. They would show it to the product team and then product would use that to make “yes” or “no” decisions so that the data scientists could proceed with the project.

More importantly, if new product folks wanted to join and they were interested in a model, looking to understand how this model worked, or what capabilities models offered. Then they could go to that Streamlit library or the data scientists could send them the link to it, and they could go through and quickly see what a model did.

Aurimas: This sounds like a UAT environment, right? User acceptance tests in pre-production.

Piotr: Maybe more like “tech stack on demand”? Like you specify what’s your project and you’re getting the tech stack and configuration. An example of how similar projects were done that had the same setup.

Mikiko Bazeley: Yeah, I mean, that’s kind of how it should be for data scientists, right?

Piotr: So you were not only providing a one-fit-for-all tech stack for Mailchimp’s ML teams, but they had a selection. They were able to have a more personalized tech stack per project.

Size of the ML organization at Mailchimp

Aurimas: How many paths did you support? Because I know that I’ve heard of teams whose only job basically was to bake new template repositories daily to support something like 300 use cases.

Piotr: How big was that team? And how many ML models did you have?

Mikiko Bazeley: The data science team was anywhere from 20 to 25, I think. And in terms of the engineering side of the house, there were six on my team, there might’ve been six on the serving team, and another six on the data integrations and monitoring team. And then we had another team that was the data platform team. So they’re very closely associated with what you would think of as data engineering, right?

They would help maintain and owned copying of the data from Mailchimp’s legacy stack over to BigQuery and Spanner. There were a couple of other things that they did, but that was the big one. Also making sure that the data was available for analytics use cases.

And there were people using that data that were not necessarily involved in ML efforts. That team was another six to eight. So in total, we had about 24 engineers for 25 data scientists plus however many product and data analytics folks that were using the data as well.

Aurimas: Do I understand correctly that you had 18 people in the various platform teams for 25 data scientists? You said there were six people on each team.

Mikiko Bazeley: The third team was spread out across several projects – monitoring was the most recent one. They didn’t get involved with the ML platform initiatives until around three months before I left Mailchimp.

Prior to that, they were working on data integrations, which meant they were much more closely aligned with the efforts on the analytics and engineering side – these were totally different from the data science side.

I think that they hired more data scientists recently. They’ve also hired more platform engineering folks. And I think what they’re trying to do is to align Mailchimp more closely with Intuit, Quickbooks in particular. They’re also trying to continuously build out more ML capabilities, which is super important in terms of Mailchimp’s and Intuit’s long-term strategic vision.

Piotr: And Miki, do you remember how many ML models you had in production when you worked there?

Mikiko Bazeley: I think the minimum was 25 to 30. But they were definitely building out a lot more. And some of those models were actually ensemble models, ensemble pipelines. It was a pretty significant amount.

The hardest part that my team was solving for, and that I was working on, was crossing the chasm between experimentation and production. With a lot of stuff that we worked on while I was there, including optimizing the templating project, we were able to significantly cut down the effort to set up projects and the development environment.

I wouldn’t be surprised if they’ve, I don’t wanna say doubled that number, but at least significantly increased the number of models in production.

Piotr: Do you remember how long it typically took to go from an idea to solve a problem using machine learning to having a machine learning model in production? What was the median or average time?

Mikiko Bazeley: I don’t like the idea of measuring from idea, because there are a lot of things that can happen on the product side. But assuming everything went well with the product side and they didn’t change their minds, and assuming the data scientists weren’t super overloaded, it might still take them a few months. Largely this was due to doing things like validating logic – that was a big one – and getting product buy-in.

Piotr: Validating logic? What would that be?

Mikiko Bazeley: For example, validating the data set. By validating, I don’t mean quality. I mean semantic understanding, creating a bunch of different models, creating different features, sharing that model with the product team and with the other data science folks, making sure that we had the right architecture to support it. And then, for example, things like making sure that our Docker images supported GPUs if a model needed that. It would take at least a couple of months.

Piotr: I was about to ask about the key factors. What took the most time?

Mikiko Bazeley: Initially, it was struggling with the end-to-end experience. It was a bit rough to have different teams. That was the feedback that I had collected when I first got there.

Essentially, data scientists would go to the development and training environment team, and then they would go to serving and deployment and would then have to work with a different team. One piece of feedback was: “Hey, we have to jump through all these different hoops and it’s not a super unified experience.”

The other part we struggled with was the strategic roadmap. For example, when I got there, different people were working on completely different projects and sometimes it wasn’t even visible what these projects were. Sometimes, a project was less about “How useful is it for the data scientists?” but more like “Did the engineer on that project want to work on it?” or “Was it their pet project?” There were a bunch of those.

By the time I left, the tech lead there, Emily Curtin – she is super awesome, by the way, she’s done some awesome talks about how to enable data scientists with GPUs. Working with her was fantastic. My manager at the time, Nadia Morris, who’s still there as well, between the three of us and the work of a few other folks, we were able to actually get better alignment in terms of the roadmap to actually start steering all the efforts towards providing that more unified experience.

For example, there are other practices too where some of these engineers who had their pet projects, they would build something over a period of two, three nights, and then they would ship it to the data scientists without any testing, without any whatever, and they’d be like, “oh yeah, data scientists, you have to use this.“

Piotr: It is called passion *laughs*

Mikiko Bazeley: It’s like, “Wait, why didn’t you first off have us create a period of testing internally.” And then, you know, now we need to help the data scientists because they’re having all these problems with these pet project tools.

We could have buttoned it up. We could have made sure it was free of bugs. And then, we could have set it up like an actual enablement process where we create some tutorials or write-ups or we host office hours where we show it off.

A lot of times, the data scientists would look at it and they’d be like, “Yeah, we’re not using this, we’re just going to keep doing the thing we’re doing because even if it’s suboptimal, at least it’s not broken.”

Golden paths at Mailchimp

Aurimas: Was there any case where something was created inside of a stream-aligned team that was so good that you decided to pull it into the platform as a capability?

Mikiko Bazeley: That’s a pretty good question. I don’t. I don’t think so, but a lot of times the data scientists, especially if there were some senior ones who were really good, they would go out and try out tools and then they would come back to the team and say “Hey, this looks really interesting.” I think that’s pretty much what happened when they were looking at WhyLabs, for example.

And that’s I think how that happened. There were a few others but for the most part we were building a platform to make everyone’s lives easier. Sometimes that meant sacrificing a little bit of newness and I think this is where platform teams sometimes get it wrong.

Spotify had a blog post about this, about golden paths, right? They had a golden path, a silver path, and a bronze path or a copper path or something.

The golden path was supported best. “If you have any issues with this, this is what we support, this is what we maintain. If you have any issues with this, we will prioritize that bug, we will fix it.” And it will work for like 85% of use cases, 85 to 90%.

The silver path includes elements of the golden path, but there are some things that are not really or directly supported, but we are consulted and informed on. If we think we can pull it into the golden path, then we will, but there have to be enough use cases for it.

At that point, it becomes a conversation about “where do we spend engineering resources?” Because, for example, there are some projects like Creative Studio, right? It is super innovative. It was also very hard to support. But MailChimp said, “Hey, we need to offer this, we need to use generative AI to help streamline our product offering for our users.” Then it becomes a conversation of, “Hey, how much of our engineers’ time can we open up or free up to do work on this system?”

And even then, with those sets of projects, there’s not as much difference in terms of infrastructure support that’s needed as people would think. I think especially with generative AI and LLMs, where you get the biggest infrastructure and operational impact is latency, that’s a huge one. The second part is data privacy – that’s a really, really big one. And then the third is the monitoring and evaluation piece. But for a lot of the other stuff… Upstream, it would still line up with, for example, an NLP-based recommendation system. That’s not really going to significantly change as long as you have the right providers providing the right needs.

So we had a golden path, but you could also have some silver paths. And then you had people that would kind of just go and do their own thing. We definitely had that. We had the cowboys and cowgirls and cow people – they would go offroad.

At that point, you can say, “You can do that, but it’s not going to be in production on the official models in production”, right? And you try your best, but I think that’s also when you see that, you have to kind of look at it as a platform team and wonder whether it’s because of this person’s personality that they’re doing that? Or is it truly because there’s a friction point in our tooling? And if you only have one or two people out of 25 doing it, it’s like, “eh, it’s probably the person.” It’s probably not the platform.

Piotr: And it sounds like a situation where your education comes to the picture!

Closing remarks

Aurimas: We’re actually already 19 minutes past our agreed time. So before closing the episode, maybe you have some thoughts that you want to leave our listeners with? Maybe you want to say where they can find you online.

Mikiko Bazeley: Yeah, sure. So folks can find me on LinkedIn and Twitter. I have a Substack that I’ve been neglecting, but I’m gonna be revitalizing that. So folks can find me on Substack. I also have a YouTube channel that I’m also revitalizing, so people can find me there.

In terms of other last thoughts, I know that there are a lot of people that have a lot of anxiety and excitement about all the new things that have been going on in the last six months. Some people are worried about their jobs.

Piotr: You mean foundation models?

Mikiko Bazeley: Yeah, foundation models, but there’s also a lot going on in the ML space. My advice to people would be that one, all the boring ML and data infrastructure and knowledge is more important than ever. So that it’s always great to have a strong skill set in data modeling, in coding, in testing, in best practices, that will never be devalued.

The second word of advice is that I believe people, regardless of whatever title you are, or you want to be: Focus on getting your hands on projects, understanding the adjacent areas, and yeah, learn to speak business.

If I have to be really honest, I’m not the best engineer or data scientist out there. I’m fully aware of my weaknesses and strengths, but the reason I was able to make so many pivots in my career and the reason I was able to get as far as I did is largely because I try to understand the domain and the teams I work with, especially the revenue centers or the profit centers, that’s what people call it. That is super important. That’s a skill. A people skill and body of knowledge that people should pick up.

And people should share their learnings on social media. It’ll get you jobs and sponsorships.

Aurimas: Thank you for your thoughts and thank you for dedicating your time to speak with us. It was really amazing. And thank you to everyone who has listened. See you in the next episode!