MLOps Blog

ML Engineer vs Data Scientist

Sundeep Teki

7 min

11th September, 2023

MLOps

In 2010, DJ Patil and Thomas Davenport famously proclaimed Data Scientist (DS) to be the “Sexiest Job of the 21st century” [1]. The progress in data science and machine learning over the last decade has been monumental. Data science has successfully empowered global businesses and organizations with predictive intelligence and data-driven decision-making to the extent that data science is no longer considered a fringe topic. Data science is now a mainstream profession and data science professionals are in high demand a cross all kinds of organizations from big tech companies to more traditional businesses.

A decade earlier the focus of data science was more on algorithmic development and modeling to extract robust insights from data. However, as data science has evolved over the decade, it has become clearer that data science involves more than just modeling. The machine learning lifecycle, from raw data through to deployment, now relies on specialized experts including data engineers, data scientists, machine learning engineers, MLOps engineers along with product and business managers.

But when does the role of Data Scientist ends and Machine Learning engineer begins?

The role of a machine learning engineer is gaining prominence across companies as they realized that the value of data science cannot be realized until a model is successfully deployed to production. Whilst a lot of tools and technologies such as Cloud APIs, AutoML, and a number of Python-based libraries have made the job of a data scientist easier, the MLOps of putting models into production and monitoring their performance is still quite unstructured.

For a detailed look at the respective skills, responsibilities, and tech stack of various profiles, ranging from a data scientist to a data science manager, refer to my previous article on how to build effective machine learning teams in the industry [2].

There are four core steps in executing a data science project:

Problem formulation – translating a business problem into a data science problem
Data engineering – preparing the data and pipelines to process raw data for modeling
Modeling – designing and experimenting with algorithms and models for the use case
Deployment – productionizing the model after testing and monitoring its performance.

MLOps life cycle — *Lifecycle of a data science project | Source*

In large tech companies and startups, there is a more established process of going about data science, and the work is clearly demarcated along the lines. Thus, it is common to expect professionals across various sub-domains to focus on their respective areas of specialization and collaborate with each other when required. However, in smaller organizations that do not have the luxury of having a large data science team, the first few data science hires are expected to work across these distinct functions as “full-stack” data scientists.

Thus, the definition and scope of a data scientist vs. a machine learning engineer is very contextual and depends upon how mature the data science team is. For the remainder of the article, I will expand on the roles of a data scientist and a machine learning engineer as applicable in the context of a large and established data science team.

In this article, I will:

review and compare the evolving roles and responsibilities of a data scientist and a machine learning engineer in the machine learning industry;
discuss the scope of each role, similarities and differences, and how to ensure strong communication and collaboration between these two core profiles without which data science projects are bound to fail.

Differences between Data Scientist and Machine Learning Engineer

In this section, I will discuss the primary differences in skills, responsibilities, day-to-day tasks, tech stack amongst other things.

The chief responsibility of a data scientist is to develop solutions using machine learning or deep learning models for various business problems. It is not always necessary to create novel algorithms or models as these tasks are research-intensive and can take up considerable time. In most cases, it is sufficient to use existing algorithms or pre-trained models, and optimize them in the context of the problem statement. However, in more innovative and R&D-focused teams or companies, scientists may be required to produce novel research and model artifacts.

On the contrary, the main goal of ML engineers is to take the models prepared by the data scientists and take them to production. This involves multiple aspects including model optimization to make it compatible with the custom deployment constraints and building MLOps infrastructure for experimentation, A/B testing, model management, containerization, deployment, and monitoring the model performance once deployed.

These factors translate into the underlying differences in skills, responsibilities, and tech stack for the respective roles as shown in the following tables.

Data Scientist	Machine Learning Engineer
Problem-solving	Programming
Programming	Data structures
Statistics	Data modeling
Data science	Software engineering
Machine learning: supervised & unsupervised	Machine learning frameworks
Data Analytics	Statistics
Data visualization	Conceptual knowledge of ML
Written and verbal communication skills
Presentation skills

Table 1. Skills of Data scientist vs. Machine learning engineer

Data Scientist	Machine Learning Engineer
Identify and validate business problems that can be solved with ML	Deploy ML and DL models to production
Analyze and visualize data at different stages of the ML lifecycle	Optimize models for better performance, latency, memory, and throughput
Develop custom algorithms and models	Inference testing on a variety of hardware includes CPU, GPU, edge devices
Identify additional datasets and generate synthetic data	Monitor model performance, maintenance, debugging
Develop data annotation strategies	Version control of models, experiments, and metadata
Coordinate with cross-functional stakeholders	Develop custom tools to optimize the entire deployment workflow
Develop custom tools to optimize the entire modeling workflow

Table 2. Responsibilities of Data scientist vs. Machine learning engineer

Data Scientist	Machine Learning Engineer
Python / R / SQL	Python / C++ / Scala
Jupyter, SageMaker, Google Colab notebooks	Linux, Bash
Git, Github/ Bitbucket	Git, Github/ Bitbucket
Cloud: AWS/ Azure/ GCP	Cloud: AWS/ Azure/ GCP
ML: Scikit-learn, Rapids, Fast.ai	DL: PyTorch, TensorFlow, JAX, MXNet
Spark	Docker, Kubernetes
Visualization: Matplotlib, Seaborn, Bokeh	Serving: TFServing, TensorRT, TorchServe, ONNX
Metadata storage: neptune.ai, Comet.ml, Weights & Biases	Metadata storage: neptune.ai, Comet.ml, Weights & Biases

Table 3. Tech stack of Data scientist vs. Machine learning engineer

Similarities, interference & handover

Similarities between Data Scientist and ML Engineer

As evident from Tables 1-3, there is a partial overlap between the skills and responsibilities of data scientists and machine learning engineers. The tech stack is also quite similar and whilst data scientists are expected to mostly code in Python, machine learning engineers also need to know C++ for porting the model artifacts into a more efficient and faster format.

What machine learning engineers might lack in terms of subject matter expertise compared to data scientists, they make up for it in terms of knowledge of engineering tools and frameworks like Kubernetes that data scientists are less familiar with.

Data scientists usually have a STEM background or even advanced degrees like a Ph.D. in diverse fields like biology, economics, physics, mathematics amongst others. On the other hand, machine learning engineers generally have professional experience as software engineers.

While data scientists primarily deal with algorithmic and model development, machine learning engineers’ key focus is on scalable software engineering relevant to model deployment and monitoring, the remaining tasks are often common to both profiles.

In a few cases, these tasks might be shared depending on the size and maturity of the data science team, and things might work smoothly. However, more often than not, especially in larger teams and organizations, this can create considerable conflict and friction especially when data scientists and machine learning engineers work in different teams and report to different managers.

The handover process

It is possible to draw a clear line between the respective mandates of data scientists and machine learning engineers. Typically, data scientists will develop one or more candidate machine learning models and hand over these to the machine learning engineers following a specific contract.

The contract should specify:

the model accuracy,
latency,
memory,
the number of parameters,
the machine learning or deep learning framework used,
model versions,
the model predictions,
and the ground labels for the validation or test set amongst other parameters.

A structured handover contract ensures that the machine learning engineers have all necessary information to work on model optimization, any further experimentation, and deployment processes. After the handover, the data scientists become free to focus on the next machine learning use cases to take to production.

The collaboration between data scientists and machine learning engineers continues post-deployment and becomes critical especially when the models break in production. As the data scientists have greater insight into the working of the model, they are better positioned to troubleshoot and fix the models.

At the same time, some model failures are related to cracks in the underlying infrastructure developed by machine learning engineers, which they are in the best position to resolve. Continuous refinement of the model based on live data received by the model via active learning also falls under the domain of data scientists.

Communication & collaboration between Data Scientists and ML Engineers

The success of a data science team is contingent on strong collaboration across the varied profiles [2]. Data scientists and machine learning engineers collaborate continuously during model development, deployment, and post-deployment monitoring and refinement. Ideally, if these two profiles ought to be part of the same team and report to the same leadership. In such a context, collaboration becomes easier and also fosters strong collegiality and learning from each other.

However, when data scientists and machine learning engineers are part of different teams and report to different leadership, the collaboration is not as strong as it should be. In such organizational settings, data scientists and machine learning engineers do not get to interact directly as much and rely on team productivity and project management tools like Slack, Teams, JIRA, Asana, etc.

For a lot of repetitive and common use cases, the use of such collaboration tools is actually a boon and saves the team a lot of time and effort. However, the transactional nature of relying on tools whose atomic units are tickets or tasks does not create a sense of team bonding and collaboration. In data science teams that rely heavily on such tools, this is a common grievance.

Aside

When collaboration gets complicated, progressing with ML projects gets problematic.

neptune.ai offers a single place to track, compare, store, and collaborate on experiments and models. Here’s an example of how it helped improve the productivity of ML teams at InstaDeep.

We use Neptune every day because sharing and discussing results is a big part of what we do. The team’s productivity increased for this reason. Nicolas Lopez Carranza, DeepChain and BioAI Lead at InstaDeep

Full case study with InstaDeep
Dive into available collaboration features
Get in touch if you’d like to go through a custom demo with your team

For more complex tasks or projects, in-person or video collaboration is a must and should not be ignored by the leadership. It is often in these settings that the technical professionals might learn of new use cases or clients from the business leaders, and the business professionals in turn might learn of a new technical breakthrough that could solve up-and-coming business use cases. The same holds true for data scientists and machine learning engineers as well, where each party could learn of either a new algorithm, or a model, or a new framework to make data science more effective and productive.

Current industry trends

If a new version of the Harvard Business Review article in [1] were to be published in 2021, it would claim “machine learning engineer” as the sexiest job of the 2020s. While data science and model development is still a lucrative role across industry and academia, in recent years the focus in the industry has slightly shifted to building scalable and reliable infrastructure to serve data science models to millions of customers.

As of today, the machine learning engineer role is in much greater demand than that of a data scientist across the tech industry.

Industry leaders have learned that while it is great to have large, complex machine learning and deep learning models achieve state-of-the-art performance on academic benchmarks or training data, these do not yield any commercial value to the business until deployed and serving customer requests reliably and quickly at a high level of accuracy.
As more enterprises are becoming data-driven companies and establishing data science and machine learning teams or organizations [3], it is imperative for them to measure and achieve the required levels of ROI.
Big tech customer-focused companies that were early to venture into and invest in AI have already built strong teams of scientists and are now looking to enhance the production capabilities and commercialize the R&D artifacts developed by the data and research scientists.
Whilst top data scientists especially those with advanced degrees like PhDs will always be in high demand, we are currently witnessing a job market that is seeking skilled machine learning engineers whose supply is limited compared to data scientists.

The transition from Data Scientist to Machine Learning Engineer

There are numerous online courses on learning platforms like Coursera, Udacity, Udemy, etc. but there is a relative paucity of instructors and content focused on machine learning engineering practices. While building data science models can occur in a sandbox environment like Kaggle where the models are not made to serve real-world predictions, it is only possible to learn scalable model deployment, monitoring, and related machine learning engineering tasks in a real-world industry setting. As machine learning engineering and MLOps is a more applied discipline, there are fewer experts who have the required skillset to build and maintain robust infrastructure.

At the same time, existing data scientists, lured by the promise of greater potential impact, better compensation, and long-term career prospects are also seeking to transition into MLE roles.

As illustrated in tables 1, 2, and 3, there is considerable overlap between the two roles. However, machine learning engineers focus on the “engineering” aspects of taking models to production while data scientists focus on developing the right set of models for specific business problems. The most relevant skills that data scientists need to learn to become an effective machine learning engineer is software engineering including the ability to write optimized code, preferably in C++, rigorous testing, and understand and build and operate existing or custom tools and platforms for reliable model deployment and management.

It is definitely possible for data scientists to learn C++ and best practices in software engineering and software testing, as well as onboard new tools and technologies like Docker, Kubernetes, ONNX, and model serving platforms from multiple sources. However, since companies require machine learning engineers to have prior relevant experience, it becomes practically infeasible for data scientists to justify a machine learning profile if they do not have real-world hands-on experience in industry settings.

Given the chicken-and-egg nature of this problem, the best avenue for existing data scientists to transition to machine learning engineering is with their current employer. If data scientists express interest in machine learning engineering to their managers and are allowed to shadow or even assist and collaborate with machine learning engineers on specific projects, it becomes easier to make an internal transition within the same company. This represents a challenge for fresh graduates without any prior industry experience, and a similar internal transition route from data science or software engineering to machine learning engineering is the recommended pathway.

As the industry matures and companies evolve their machine learning systems and associated processes like hiring and upskilling, it will become easier for more candidates to make the transition from data science to machine learning engineering.

Conclusion

AI is a cornerstone of modern enterprise. This AI-revolution has accelerated significantly over the last decade and resulted in huge unmet demand for data science professionals. Data science as a discipline has also evolved, creating distinct profiles focused on data, modeling, engineering as well as product and customer success management. Of these profiles, machine learning engineers play a critical role in taking the models developed by data scientists based on the data prepared by data engineers and for use cases identified and developed by product or business managers to fruition.

Currently, the demand for machine learning engineers is similar to the demand for data scientists a decade ago. Such changes in the scope and nature of profiles in the AI industry will continue to happen, and present new challenging opportunities to engineers, scientists as well as business professionals to get their foot in the door.