MLOps Blog

ML from Research to Production – Challenges, Best Practices and Tools [Guide]

11 min
11th August, 2023

Taking machine learning or AI into production takes a lot of patience, effort, and resources. AI models are great for predicting all sorts of things, from what movie you’ll like to whether your cat will scratch the furniture. But in most cases, AI models have a hard time making it into production. 

In this article, we’ll discuss why it’s hard to get models to production, how you can take your machine learning experiments from research to production, and things to consider after your model is deployed to production. 

From research to production stages
From research to production – core steps | Source: Author

Issues ML model face in production 

There’s a large number of different platforms to make your model available for use. Each platform presents unique challenges, from the language, it’s written in, to the deployment and distribution models it uses. Because of these challenges, a good idea might get dumped in this phase. 

AI models work by finding patterns and relationships in data that humans wouldn’t notice. Models make predictions with high accuracy but lower precision — which can be dangerous when your business relies on a product with exact measurements or tolerances. In order for these models to be used in real-world applications, they need to be able to predict accurately and precisely. 

Training and deploying machine learning models is a major challenge for any enterprise, business, or predictive analytics company. There are several reasons why getting an ML model into production can be difficult, from the type of data available to the integration workarounds required. Let’s explore a few reasons that might impact model performance. 

Poor outlier handling

Outlier handling is a process that can eliminate outliers from the dataset. This technique can be applied on different scales to yield a more accurate data representation. This affects the performance of the model. The effect might be big or small depending on the model, for example, linear regression is very vulnerable to outliers. This process should take place before model training.

Monitoring in production

Any model in production needs to be monitored regularly. Monitoring and managing machine learning models is a key part of the workflow, as well as keeping records of all the datasets, inputs, and predictions. With time many things change, so it’s important to monitor our models. 

Say your model was performing well, but with time the performance got worse. You might have a new dataset with new information, and it’s changing the model. Or perhaps there’s a serious issue with the model that needs to be improved or updated. Whatever the case, you’ll need a way of checking whether or not you should re-optimize. It’s best to monitor and update models regularly. 

Read also

A Comprehensive Guide On How to Monitor Your Models in Production

Let’s have a look at few best tools to do ML model monitoring:

Key Features

Neptune is a lightweight experiment management and collaboration tool. It is very flexible, works with many other frameworks

Free for individuals, paid for teams

– Fast and beautiful UIrn– Experiment tracking and visualization capabilitiesrn– Store and organize your work efficiently

Simple and easy to continuously monitor ml-model performance in Azure, Google & AWS. Bulk monitor Machine Learning or Analytics Models in Qualdo

Free up 10 GB, $61/ mo after that

– Data quality tools & metrics for all stakeholdersrn– Eliminate High-stake Data issues in Minutes!

Fiddler is a model monitoring tool that has a user-friendly, clear, and simple interface

Free trial available when you get started. You can contact them for pricing

– Performance monitoringrn– Tracking outliersrn– Service metrics and alerts

→ To read more about monitoring, check the Complete Guide to Monitoring ML Experiments Live in Neptune.

Bias and variance

Imagine you’re building a predictive model to estimate the annual operating costs of a car. You gather data from automobiles of similar make and model for which you have completed this estimation. You notice that the values for the annual operating cost are all over the map — some projects cost $2,000 while others come in as high as $8,000 annually. To try and capture the variability of the data, you decide to bring in more data from other sources. Because you now have more data, you retrain your model and end up with a new set of values, but again, they’re all over the map. This is the first source of bias.

The variance of the values in our dataset is so high that you can’t make any firm statistical conclusions about our data. The reason for this is that you don’t have enough data points from similar vehicles to be able to make any concrete conclusions about the actual annual operating costs for these automobiles.  

Bias-Variance Trade-Off, When you have low bias and high variance of the model, chances are it has overlearned from the datasets. Linear algorithms tend to have a high bias but a low variance and Nonlinear algorithms tend to have a low bias but a high variance. When you increase the variance, bias will decrease and when you increase the bias the variance will decrease. This trade-off is traction between the errors established by these two. 

Class imbalance

In machine learning, a class imbalance problem is when instances of one class are more common than instances of another. While this is extremely common in many areas (for example, dogs outnumber cats by about 3 to 1), it can be quite problematic. In order to build effective machine learning algorithms, you need models that can perform well on either type of instance.

A common scenario that causes class imbalance is when the algorithm you’re training is to be used on a dataset from a new domain. For example, let’s say you’re building a spam detection method and your data comprises thousands of emails where most of them are spam and only a few are non-spam. This creates an imbalance, as 85% of the instances in your dataset are spam. 

One way to solve this is to take more data from the new domain, but that could prove costly. Instead, you could take some existing data from a similar domain with low-class imbalances, and train our algorithm on this smaller set before using it on the original larger set with imbalanced classes.




Better than simpler models


Fast Training

Fast Inference


Static, Clean

Constantly Shifting, Missing Values and labels


Good to have


From ML research to production: challenges

→ Read more about MLOps in Best MLOps Platforms to Manage Machine Learning Lifecycle

Getting ready for production

Let’s look at the steps that take place from research to production. They’re common steps, but they might change depending on your machine learning model or application. We’ll discuss ways to improve and optimize your model. 

  • Research and review 

The first key step is deep research about everything; your product, model, algorithms, tools, and everything else. These things change from time to time with new technology or approaches. You have to stay up to date with everything. First, you have to research different approaches with respect to the problem. Research often involves reading research papers or articles, watching presentations or videos, or playing with tools and code. All of this is part of the research process. 

Talk to researchers from the same or different fields. This will give you a good amount of knowledge on how things work and what can go wrong. It’ll also help you learn about resources and tools.

There are thousands of research papers published around the world, but few ever make it before a global audience. Below are few key research papers in AI:

There are many more great research papers, I added a few more at the end of this article. 

  • Data 

Once you completed research for your experiments, it’s time for the step that will define how your model will work — collecting data from all the sources available. The more data, the better. There are two types of data; structured, and unstructured. Structured data includes dates, numbers, etc. Unstructured data are large files that include images, text, videos, etc. 

There are many different ways to collect data for machine learning experiments. Consider the problem of image classification. In order to classify images, first, you need to have a lot of images. Collecting them manually is very time-consuming and it would take quite a while to classify thousands of images manually. Machine learning algorithms can be used to quickly process these thousands of images and save time. 

  • Exploratory Data Analysis

When working with machine learning applications, it’s important to understand your data. Exploratory Data Analysis (EDA) is a process of identifying patterns, anomalies, and outliers in data sets. It’s done by exploring large amounts of data, visualizing the patterns to find trends, discovering unexpected results or irregularities, and making sense of them.

Many important decisions in business are made on the basis of exploratory data analysis. The technique helps identify what to test against those hypotheses, which can help prove or disprove a theory, whether it’s marketing research or estimating your costs from production materials. This process involves detecting causes for effects and understanding relationships between variables based on the groupings they’re assigned to. Mostly, EDA works when there are many different variables that cause results and you may really have no idea of what results or patterns should look like. It can even be applied to things like a complex logistical problem by simply placing data points on a map along with the attributes that make up those points. 

Exploratory Data Analysis types:

EDAs are typically graphical or non-graphical (quantitative). 

  1. Univariate non-graphical – this is a form of analysis where data is analyzed by one variable. It’s the simplest method among all the EDA types. The motive here is to find and draw patterns. 
  2. Univariate Graphical – When graphical methods are used instead of non-graphical methods, it provides a full picture of data. Univariate Graphical consists of three subparts; histogram, Stem and Leaf, and box plots. Histograms are plots that depict the total number of cases for the range of values. Stem and leaf are plots that represent the shape of distribution and data values. Box plots show a summary of minimum, median, and maximum. 
  3. Multivariate Non-Graphical – This method represents the relation between two or more variables of data.
  4. Multivariate Graphical – This is a graphical representation of the relation between two or more variables of data. In most cases, bar plots are used for this method. 

Exploratory Data Analysis tools:

Python and R are the most used languages in AI and data science applications. There are other tools that can be used if you have less knowledge of coding. Trifacta, Excel, Rapid Miner, Rattle GUI, and Qlikview are a few good non-programming tools for EDA. 

  • Feature engineering and model selection

Feature engineering is a method in machine learning which uses data to generate new variables that aren’t present in the training set. Once you’ve prepared your data, this process is the next step. It can generate new features for supervised and unsupervised learning, typically with the goal of simplifying and accelerating data transformations and improving model accuracy. Feature engineering is a crucial method while working with machine learning models. If you have a bad feature, this will directly affect your model, irrespective of the data or architecture. It’s a simple process of transforming raw observations into desired features with the help of statistical or machine learning techniques. It’s always a good idea to add new features to your model, as it improves flexibility and decreases the variance. 

Check also

The Best Feature Engineering Tools

Now, let’s take a simple example. Below is the price of houses in a specific area. It shows the area and total price of the house. 

Research to production - example 1

When you’re working with data, chances are that your data will have issues. The data might be coming from the internet or various other sources, and be filled with errors. So, after you’ve collected your data, you will create a new column that will show the price per square foot.

Research to production - example 2

Once you create the new column you can now use domain knowledge. For example, you can consult any real estate person to confirm the square ft. prices. If the person says that prices per square ft cannot be below 3500, then you might have an issue. You can visualize the data to see it better.

Research to production - example 3

When you plot the data, you can see above that one particular price is quite different from others. You can spot the error easily in the visualization method. You can also use Math/Statistics to observe your data. 

Model Selection is a method that can be applied to different types of models, or the same type but different hyperparameters. In simple terms, it’s the process of selecting models as the final model that defines the problem. There are many things you need to consider while selecting models, and many different techniques for model selection.  

→ Read more about model evaluation and selection in The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

  • Model development

Model development is often misunderstood because people think this step takes the most time. However, usually, most of the time is spent on cleaning, preparing, and exploring data. In this process, you train, test, and validate sets. Why use three processes instead of training the model and testing it? Working on model development requires configuration tuning. It’s done with the help of the feedback received from the validation dataset. In simple terms, you can call this a form of learning. Our main goal is to get accurate outputs on unseen data.

The goal is to give the models as much knowledge as possible about the attributes of the objects in their domain and use that knowledge in order to make accurate predictions. A number of these approaches are used across different fields, including natural language processing, data mining, user-interface design, computer vision, and many more. 

When training models, it’s important to monitor the relation between optimization and generalization. Optimization is a way of calibrating a model to get the best outcome on a training dataset. Generalization is a process that lets you know how a model performs on unseen data. Your model might perform poorly at some point, which happens due to overfitting or underfitting. Because of this, generalization stops to improve and the model becomes less accurate. You can stop overfitting or underfitting by adding more data, which is the best way to generalize. 

Few things to keep in mind while building models for production

Generalization: This process shows how our model performs on new (unseen) data. The ultimate goal is to get the best generalization strength. It’s better to spend more time on preparing a good validation environment than working months on building models and failing.

Performance: There are three ways to calculate model performance. First, getting a good cross-validation score. It’s used for comparing and selecting a model for a specific predictive problem. It involves splitting a large project into many small subprojects so they can be completed at roughly the same time by different workers. K-fold cross-validation is a process to determine the model performance on unseen data. In K-fold validation, you split data into k separate groups, pick a testing set and remaining as a training set, test it on the testing set and evaluate the outcome and score. Another way is to determine the Production score, when the model is in beta you can monitor its performance on live data. Interpretability makes sure the model is not hard to explain.

Iterative process: Machine learning is a long process that involves collecting, cleaning, preparing, analyzing data, fitting models, getting outputs, monitoring, modifying, and a lot more. So, don’t think it will be a one-shot process of getting models to production. 

Regularization: It’s the process of reducing error by fitting the model on training datasets and keeping it away from overfitting. There are a few regularization methods:

  • One commonly used way is to reduce the model size by reducing the parameters in the model. You can set different sets of parameters and test performance.
  • L1 Regularization, known as least absolute shrinkage and selection operator. It adds the absolute value of the weights coefficients. 
  • L2 Regularization, known as Ridge regression. It squares the value of weight coefficients. 

Benchmark Model: A Benchmark Model is the most easy-to-use, reliable, transparent, and interpretable model that you can compare your model with. It’s best practice to check if your new machine learning model performs better than a known benchmark in test datasets.

A benchmark model is easy to implement and doesn’t take much time. Use any standard algorithm to find a suitable benchmark model, and then just compare the results with model predictions. If there are many common features between standard algorithms and ML algorithms, a simple regression might already reveal possible problems with the algorithm.

Hyperparameter tuning: Hyperparameter tuning is about improving hyperparameters, which control the behavior of a learning algorithm. For example, learning rate (alpha), or the complexity parameter (m) in gradient descent optimization.

A common hyperparameter tuning case is to select optimal values by using cross-validation in order to choose what works best on unseen data. This evaluates how model parameters get updated during the training period. Often this task is carried out manually, using a simple trial and error method. There are plenty of different ways to tune hyperparameters, such as grid searches, random search methods, Bayesian optimization methods, and a simple educated guess. 

 Model Optimization: Once you’ve completed training and got the desired outcome, it’s time to make the model better at predicting. Machine learning optimization is the process of optimizing machine learning using mathematical principles. It goes like this — you have some data, and you analyze it with a non-trivial machine learning algorithm to find patterns. There are many algorithms available for various tasks but due to the complexity, ML optimization can be difficult. ML optimization requires both experts in mathematics and experts in statistics, as well as relevant domain knowledge because it relies on statistical methods learned from areas of expertise. We’ll see a few ways to optimize machine learning models. 

Gradient Descent: Gradient descent is a technique that can be used to optimize a parameter of an optimization problem. It will move the parameter towards its optimal value by performing small changes in the direction of the steepest slope in the gradient. It’s useful because it helps with stationary problems where you know what our constraints are, and you have enough data to determine how good our current value is. Let’s take a simple example, two people having a conversation:

Smit: “Roy, how much did you score in the math test”
Roy: “Guess!”
Smit: “80%? ”
Roy: “I’m not that good at math.“
Smit: “60%?”
Roy: “No, that’s too low.”
Smit: “Is it around 70%?”
Roy: “Yes, very close!”

This is how gradient descent works. In gradient descent, you start with a random guess and slowly move to the correct value. The process starts by using gradient descent to find the intercept, and you will use it to solve for the intercept and the slope.

  • It’s effective and stable,
  • It’s easier to use in a short duration of time, 
  • Chances are it might not work if there are many local minima.

Genetic algorithms: A genetic algorithm is a search optimization method. The technique mimics the natural selection process that occurs in nature, where better solutions to a given problem are more likely to survive and reproduce than worse solutions. For example, there are many models, but you have to keep the model which has height accuracy.

  • You can find a good outcome in a short duration, 
  • It provides a good number of solutions,
  • You can’t be sure if the outcome is optimal or not. 

Exhaustive search: It’s a method to search optimal hyperparameters by checking every option. For example, you forgot your email or phone’s password, so you try every possible option. But in machine learning, it’s done on a very large set. 

  • This process checks all the possible options,
  • It might make things very slow,
  • It can be more effective when the sets are small. 

 Final step: Now that everything is completed, it’s time for end-to-end testing and final training of the model. You have to test many implementations to make sure everything is going as planned. Once everything is going well, you go to production!

Things to consider after making it to production

Your experiments might be up and running, but it’s not the end of your work. You’ll have to keep working on production architectures. You will fill the gaps in datasets, monitor the workflow, and update the system regularly. 

Program reusability: Working with a repeatable, reusable program for your data preparation and training phase makes things robust and easy to scale. Notebooks are generally not easy to manage, but it might be a good option to use Python files as it will increase the quality of work. The key to developing a repeatable pipeline is to treat your machine learning environment as code. This way, your entire end-to-end pipeline can be executed upon significant events.

Data and monitoring: For the continuous process of your ML Experiment, you have to stay focused on data as well as the environment. If the input data changes, you may find out issues with model accuracy. Monitoring and managing machine learning models is an important part of the workflow (keeping records of all the datasets, inputs, and predictions). Sometimes, the implemented data and the connected label might change.  

Automation: An often overlooked part of machine learning is automation. Tasks that would otherwise take hours to become simple with the help of an automated script or tool can save time in software development and focus on important steps, like model evaluation and feature engineering.

Governance: As you go to production, the number of developers and data scientists working on the product/service will increase, as it will help to distribute work and make delivery faster. Without proper governance at production, many issues can arise. You’ll have to create a hub where every member of the team is connected and has access to necessary things. This makes things go really smooth and easy to maintain. Managing and organizing your machine learning experiment is a task in itself. 

How Neptune can help you bring your projects from research to production

Neptune is a metadata store for MLOps, developed for research and production teams. It gives us a central hub to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle. Researchers and engineers use Neptune for experiment tracking and model registry to control their experimentation and model development. 

Metadata store
Metadata store is a connection between research and production stage

For research: 

  • Log and display all metadata types including parameters, model weights, images, HTML, audio, video, etc.,
  • Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team,
  • Compare metrics and parameters in a table that automatically finds what changed between runs and what are the anomalies,
  • Automatically record the code, environment, parameters, model binaries, and evaluation metrics every time you run an experiment,
  • Your team can track experiments that are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster),
  • Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)

For production: 

  • Track Jupyter notebooks, 
  • Collaborate and supervise projects,
  • Train models in real-time,
  • Provides individuals and teams with notebook checkpointing and model registry to track model version and lineage,
  • Track thousands of runs,
  • Version, store, organize, query models, and much more.

Learn more

How to get started with Neptune in 5 minutes ->


Getting your models to production is difficult, but the more experience you have, the better you’ll get at it. A good amount of patience, resources, and commitment can make the journey to production amazing. I hope you liked this article, thanks for reading!

Was the article useful?

Thank you for your feedback!