Developing successful machine learning projects requires a variety of expertise, and hence, it is an extremely collaborative job. Designing a problem statement needs the company to ask critical questions such as:
- Why build an ML model?
- How to get the required data?
- How to design the systems infrastructure? and more.
ML has been around for a while, but there’s still some confusion about job titles and the responsibilities that come along with them. In this post, we will discuss the evolution of an ML project, how roles are distributed in an ML team, and how they collaborate.
Bookmark for later
How does an ML project progress?
Roles in an ML team align very much with how a project is designed. Typically, in product-based companies, it goes like this:
What feature/component (requiring ML) to design?
Product managers, along with insights from analysts and other business stakeholders decide the high-impact features that could benefit from Machine Learning.
How do we design the business problem as a data science problem?
Any business requirement needs to be translated to a data science problem. For example, consider an online fashion website that is seeing a large number of clothing items being returned. The analysts found that this is because they tend to choose a smaller/bigger size while ordering. The data science approach would include training a model using user and product information and recommend a correct size.
However, just because the project makes sense, it doesn’t mean that the company should pursue it straight away. The business team first estimates the amount of savings in shipping an ML model would generate. Finally, since the company has limited human resources, it has to decide how feasible this project would be, at this point in time, as compared to other projects. These decisions are based on the company’s strategy, vision, and the calculated RoIs.
Generally, senior and lead data scientists in a company work with business managers to define a clear data science problem.
Do we have the data and the pipelines required for it?
Quite often, even framing a problem as an ML problem is not enough. You need to make sure you have the right type—quantity and quality of data, and the right infrastructure to ingest and store the data. This often requires integrating multiple cloud services, databases, and APIs. In the above example, you might want to record the details of the product that was returned, user data like age, gender, past purchases. In a team, data engineers have answers to almost every question about where the data is being stored and how it is processed.
Is the data quality good enough?
In the above example, if the images of the products are of low quality and resolution, using them as inputs in the model would not be optimal. Similarly, user attributes like their height, weight, and their preferences could keep changing, or sometimes they use their account to order for someone else. Thus, getting relevant data is not as straightforward as it may seem. Analysts and data scientists usually inspect the data to check if it is in the right form.
What kind of annotation and data procurement efforts are required?
Although much progress has been made in unsupervised learning and transfer learning, many industry problems are supervised and need a considerable quantity of annotated data. Annotations can be tricky too. Annotating cats and dogs are easier than annotating the sentiment of a tweet because different annotators can have their own biases while doing so. Companies requiring frequent annotation efforts have a dedicated team of annotators or they outsource it to annotation companies.
Do we have resources (employees, computational, time) to solve it?
As different sub-fields in applied ML (NLP, CV, RL, Speech) grow deeper, it is hard for a single data scientist to be an expert in each of them. This is why large companies have multiple data science teams, each dealing with a smaller part of the product.
How to roll out the model to the users?
Calling .fit() and .predict() works well in notebooks, but in real life, there are a lot of engineering and user experience concerns when deploying a model. For example, generating recommendations when a user opens an app could increase the load time. Instead, one might choose to update the recommendations overnight when the majority of the users are not on the app and store it on their devices. Usually, ML engineers and developers are behind these efforts.
What metrics do we track to measure the success of the model?
Perhaps the most important thing to decide is what metrics indicate that the model was successful in its objective. In our fashion store example, we would record if a user ordered clothing of the size recommended by us or not. In either case, did he return the item? If he took the recommendation and still had to return because of incorrect size, our model made a mistake. Similarly, if he/she didn’t take our recommendation and did not return, our model might have made a mistake there too. Metrics are decided at the very beginning of a project.
Roles in ML team
A single person cannot answer all the above questions. Hence, a matured ML team typically consists of the following:
- Data Analysts
- Data Engineers
- Data Scientist
- Research/Applied Scientists
- ML Engineers
We will discuss each of these roles in detail.
Data Analysts – They work closely with product managers and business teams to derive insights from user data, which are then used to drive the product roadmap. Typically, they use tools like SQL, Excel, and a range of data visualization tools like Tableau and Power BI. Their core skills include analyzing data using descriptive and inferential statistics. Their role is divided by the function they operate in. For example, a marketing analyst would analyze the marketing efforts and outcomes of a company, while a risk analyst would analyze fraud patterns in credit cards.
Data Engineers – Data Engineers make sure that the infrastructure to collect, transform/process and store data is well-built. They manage how the data from the application is ingested and transferred across databases and other storages, and their key skillset includes using tools like Spark or Hadoop to handle large volumes of data. They also work with cloud platforms for building data warehouses. Besides, they are responsible for ETL (Extract Transform Load) jobs, which essentially means taking data from a source, processing it, and storing them in data warehouses. Since they handle intensive computational tasks, they are often expected to know distributed systems fundamentals, data structures, and algorithms.
Data Scientist – This is everyone’s favorite, mostly because it is so overused. Generally, they are responsible for analyzing, processing, and interpreting data. Data scientists also use advanced statistics to derive insights from it and communicate their findings to business stakeholders. Besides, they build ML models which become a part of the product. Depending on the scope of the job, they are required to know Statistics, SQL, Python/R, and Machine Learning.
Research Scientist – Companies that focus on bleeding edge technologies often have this role. Generally, research scientists develop new algorithms for various product-related areas. Although, they are not necessarily required to translate their work into production models. A research scientist usually has some specialized knowledge in NLP, computer vision, speech, or robotics and they acquire it through a Ph.D. or extensive research experience. Their job requires them to conduct research, publish papers and solve hard research problems.
Machine Learning Engineer – Their work overlaps with that of data engineers, but they focus largely on ML models and the related infrastructure. Their job is to build tools for updating models and creating prediction interfaces for end-users. They work closely with data scientists and deploy models made by them. ML Engineers leverage cloud services as well as open-source deployment libraries like Cortex or FastAPI to create endpoints. Often their work spans multiple deployment scenarios like cloud, on-prem, edge deployments. To handle scalability, they often need to know Docker and orchestration platforms like Kubernetes.
Developers – You have your model, you have your results and you have your infrastructure. Now what? The last piece of the puzzle, arguably one of the most important ones, is to integrate everything with the main application. That’s where you need backend developers. They often design the APIs and format the model prediction into something user-friendly.
ML teams in startups
In a mature company like Google or Amazon, finding these specific roles is common. However, if you are working for a smaller startup, chances are, you might not find a separate research scientist or an ML engineer. In such companies, one or more data scientists are expected to handle almost all components – analysis, model development, and deployment. You may call such people Full Stack Data Scientists.
There are pros and cons to such a setting. On one hand, the company saves costs as it is cheaper to pay someone a little more than it is to hire multiple people for each part. Besides, if you work in such a company, you get to learn a breadth of technical skills. Besides, communication is easier in smaller teams. On the other hand, having different people specializing in different parts of the project requirements ensures that you get high-quality work with best practices.
How do ML teams collaborate?
We now have a fair idea as to how an ML project progresses. But how do they communicate and conduct their workflows? There are tools made especially for collaboration.
Project description – Slite/Notion/Google Docs
Tools like slite, notion, and google docs help maintain documentation of the project, scope, requirements, and responsibility. Working with teams requires access to a common source of truth.
Often, software and ML teams create detailed documentation (sometimes called Business Requirement Documentation or BRD) that outlines the exact outcome and the objective of a certain project. This practice helps in getting everyone on the same page.
For example, if a company wants to start working on updating an existing recommendation system, they need to brainstorm the entire project guideline. Starting with the status of the current recommendation system, the current metrics, what problems does the company wants to focus on in this new release to who will work on the project. Tools like Slite and Notion allow the users to create tables, kanban boards, checklists, and well-formatted headings and paragraphs. Once everyone is clear on the background, roles and responsibilities are assigned and the project timeline is finalized.
If you are a developer, you must be aware of GitHub. There are millions of open source projects, used by millions of developers around the world. Any popular project has hundreds of contributors, around the world. Imagine how easy must GitHub make to collaborate on code.
It not only helps you push code and submit PRs but also handles your CI/CD load and enables you to collaborate on issues and code-related documentation.
MLOps metadata store – Neptune
Unlike developers, data scientists not only need to track their code but also results from various ML experiments, hyper-parameters, and different versions of datasets. Neptune allows them to track all of it with their client APIs (using just a few lines of code) that integrate seamlessly with popular ML libraries and hyper-parameter tuning frameworks.
Moreover, Neptune also provides capabilities to version and log your models, datasets and creating your own CI/CD pipeline for automated deployments.
The platform logs the metrics, CPU/GPU utilization, code, and model files for you so that you focus more on experimentation.
👉 Learn more about Neptune’s features.
In conclusion, the ML ecosystem has developed enough to identify key roles and responsibilities involved in an ML project. However, depending on the company type and size, the scope of work each job title entails can vary. If you are working or looking to set up your team or improve team output, there are many tools and role structures to choose from that could make workflow more efficient by allowing easier inter and intra-team collaboration.
How to Improve the Collaboration in the ML/DS Team?
As a run tracking hub, Neptune provides several features for enabling knowledge sharing and collaboration among members of your data science team.
- have every piece of every run or notebook of every teammate in one place,
- see and compare all the teams’ experiments and models,
- see what everyone on the team is working on,
- share a view on a project or any of its parts, by simply copying and pasting the URL to it,
- collaborate with other team members on the results.
- Seeing all model training metadata in one place
- Comparing model training runs
- Seeing model training runs live
- Being able to reproduce model training runs
- Have a central registry for the models, runs, and notebooks,
- Check how the model was built,
- Find and fetch information they need for putting model in production.