In this article, you will get a compact overview of MLOps, its stages. You will also get a walkthrough of instances when MLOps is an organizational and communication problem and when it is a tech problem and how to resolve these challenges.
What is MLOps?
MLOps is closely inspired by the concept of DevOps where the development team (Dev) and the operations team (Ops) collaborate through a systemic and standard process.
MLOps, the combination of Machine Learning and Operations is the initiative to combine the development and production counterparts of any data science project. In other words, it seeks to introduce structure and transparency in the end-to-end ML pipeline such that the data scientists can work in an organized way and interact smoothly with the data engineers and the technical/non-technical stakeholders on the production end.
Why is there resistance in implementing MLOps?
Data Science teams are relatively new in the corporate world and the process of any data science project still cannot be charted properly. A lot of research and iterations with data are involved which cannot be tracked using a fixed system. Unlike regular engineering teams, the process and methods of an ML team are not absolute and often have a constant question hanging in the air: if this does not work, what will?
That is why data science teams often feel they are restricted when terms, conditions, and timelines are laid upon their methods.
Why is MLOps necessary?
Imagine having two dry pieces of bread with a lot of filling between them. It will get messy and there is a high probability that most of the stuff will fall off. The solution to this is to add something sticky like cheese that holds the two pieces and the fillings together. MLOps is just that: the tool to hold the entire process of a Machine Learning project together.
Might be useful
Check MLOps Tools Landscape to find the best MLOps tools for your use case.
Just like the two pieces of bread, the modeling and the deployment teams are often out of sync with a lack of understanding of each other’s needs and methods of work. Without MLOps to hold the pieces together, the efforts and the hard work from both ends are bound to fall off before reaching the end consumer.
It is estimated that around 60% of machine learning projects are never implemented. Being a part of the data science community, I have been a witness to several projects that could not reach the end customer because of non-scalable and inefficient frameworks. The science of data is extremely important for predictions, but if the predictions remain localized to the system of the data scientist, it is no good to the end customer.
Even though MLOps is still a concept that is being experimented with, it can shorten the timelines of ML projects considerably and ensure that a higher percentage of projects get deployed. Often, ML teams across the globe have the persistent complaint that a project, on average, takes up to 6 months to be developed and deployed.
With MLOps, the time complexity of a process can be reduced significantly; and since data science teams are just stepping into the realms of the IT world, it is the perfect time to persistently analyze and establish the right order.
How is MLOps similar to DevOps? (Benefits and challenges)
The concept of MLOps has been modeled after DevOps and therefore, there are some core similarities between the two:
- CI/CD Framework – Continuous Integration and Continuous Delivery is the core framework of both software and machine learning solutions since improvements are continuously suggested and added to the existing code.
- Agile Methodology – The agile methodology is designed to make the process of continuous integration and improvement less cumbersome by dividing the end-to-end process into multiple sprints. The benefit of using sprints is that the result of every sprint has high transparency and is distributed across all teams and stakeholders involved in the project. This ensures that there are no last-minute surprises during the later stages of solution development. Overall, the agile methodology solves the following problems:
- Lengthy development-to-deployment cycle – The primary delay occurs when the development team builds an entire solution and hands it over to the deployment team as a whole. There is a wide gap in this transfer since a lot of dependencies are often left unresolved. With the help of agile methodology, the hand-offs are sprint-wise and the dependencies are cleared off from the first few stages of the development itself.
- Lack of sync between teams – Every data science project does not just include a development and deployment team. A lot of stakeholders like project managers, customer representatives, and other decision-makers are also involved. Without the implementation of MLOps, the visibility of an ML solution gets reduced manifold and the stakeholders only get a taste of it towards the later stages of development. This has the potential to mess up the entire project since business-oriented teams are known to provide expert advice that can turn around the whole solution. Through agile methodology, their inputs can be taken right from the start and considered across all sprints.
How is MLOps different from DevOps?
Despite the similarity between the core framework of MLOps and DevOps, some stark differences make MLOps much more unique and difficult to deal with. The key reason being data. Variations in data can bring in an added hustle with MLOps. For instance, change in data can lead to changed combinations of hyperparameters which can make the modeling and testing modules act entirely differently. Let’s look at the different stages of MLOps to understand the differences closely.
The stages of MLOps
In ML projects, there are five umbrella stages:
- Use case identification: This is when the development and business teams collaborate with the client to understand and formulate the right use case for the problem at hand. This might involve the preparation of plans and software documentation, and getting them reviewed and approved from the client’s end.
- Data engineering and processing: This is the first technical touchpoint where data engineers and data scientists look at the various sources from where data can be derived for solution development. Identifying the source is followed by building the pipeline that can streamline the data in the correct format and also check the data sanity in the process. This requires close collaboration between data scientists and data engineers, and also a few nods from different teams like the consultancy team who might be using the same data master.
- ML solution building: This is the most intense component of the MLOps life cycle where the solution is both developed and deployed in continual cycles with the help of a CI/CD framework. There are multiple experiments involved in an ML solution with multiple versions of the model and data. Every version must therefore be well tracked such that one can easily refer to the optimal experiment and model version at later stages. Development teams often do not track these experiments with a standard tool and end up jumbling the process, increasing the work and time requirement and in all probability, reducing the optimal model performance. Every module that is successfully created should be deployed iteratively while leaving ample space for changing the core elements such as hyperparameters and model files.
- Production deployment: Once the ML solution is built, tested, and deployed in local environments, it should be connected to and deployed on the production server which can be on public cloud, on-premise, or hybrid. This is followed by a short test that checks if the solution is working on the production server, which it usually does since the same solution has been tested previously in the local server. However, sometimes due to a mismatch in configurations, the test fails and the developers must promptly pinpoint the reason and find a fix for it.
- Monitoring: After the solution is successfully deployed on the production server, it must be tested on a few batches of new data. Depending on the data type, the crucial time for monitoring can range from a few days to a few weeks. Some specific and pre-decided target metrics are used to identify if the model keeps serving the purpose. Triggers are used for prompt alerts when the performance is below expectations and calls for optimization by going back to the third stage of solution building.
When MLOps is an organizational and communication challenge
- Long-chain of approvals – For every change that needs to be reflected on the production server, approval must be obtained from the relevant authorities. This takes a long time since the verification process is lengthy, ultimately delaying the development and deployment plans. This problem is not just about production servers, however, but also exists in terms of provisioning different company resources or integrating external solutions.
- Provisioning within budget – Sometimes, the development teams cannot use the company’s resources because of budget limitations or because the resource is shared across multiple teams. Sometimes, new resources are also required for ML solutions specifically. For instance, resources with high-powered computing or huge storage capacity. Such resources, even though vital for scaling ML solutions, fall out of most organization’s budgeting criteria. In that case, ML teams have to find a workaround that will often be suboptimal, to make the solution work with the same vigor if possible.
- ML stack that differs from the homegrown framework – The software deployment framework that most companies have been working on might be suboptimal or even irrelevant for deploying ML solutions. For instance, a python based ML solution might have to be deployed through a Java-based framework just to comply with the company’s existing system. This can create double work for the development and deployment teams since they have to replicate most of the codebase which is taxation on both resource and time.
- Security go-ahead – Every machine learning model or module is almost always part of a big repository. For example, a version of the random forest model is coded in the scikit-learn library, and users can simply use a function call to kickstart the model training. Most often, users are unaware of what the code is doing and which sources other than the data it is tapping into or sending out. Therefore, to ensure security, the code of encapsulated functions needs to be checked by the relevant security teams before the code is uploaded to the production server. Furthermore, on-cloud security APIs, the company’s security infrastructure, and licensing factors are time-intensive and need to be taken care of.
- Sync across all teams concerning data – A common data master obtained from a customer is used by multiple teams like sales teams, consultancy teams, and of course, the development teams. To avoid discrepancies in the data, a common set of column definitions and a common data source must be maintained. Even though this sounds doable, in real-world scenarios, due to the involvement of multiple contributors, this often becomes a challenge leading to a lot of back and forth between teams to resolve internally created data issues. An example of such an issue could be that after building the entire solution, the developer realizes that one of the columns is being populated differently by the customer onboarding team.
The solutions to resolve the organizational and communication challenges in MLOps
- Virtual environment for ML stacks – For new companies and startups who have been implementing the agile methodology and have adopted most of the relevant frameworks, a backdated homegrown framework is not a problem. However, companies that are relatively old and function uniformly on a previously built framework, might not see the best results from ML teams in terms of resource optimization. This is because the teams will be busy figuring out how to best deploy their solution through the available framework. Once figured out, they have to repeat the suboptimal process for every solution that they want to deploy.
A long-term fix for this is to invest resources and time into creating a separate ML stack that can integrate into the company framework, but also reduce work on the development front. A quick fix for this is to leverage virtual environments to deploy ML solutions for the end customer. Services such as Docker and Kubernetes are extremely useful in such cases.
- Cost-benefit analysis – To reduce the long line of approvals and budget constraints, the Data Science development teams often need to delve into the business side and do a thorough cost-benefit analysis of limiting provisions vs the return on investment from working Data Science solutions that can run on those provisions. The teams might need to collaborate with the business operations team to get accurate feedback and data. The key decision-makers in organizations have either a short-term or a long-term profit-oriented view, and a cost-benefit analysis that promises growth can be the driving factor that opens up some of the bottlenecks.
- Common key reference – To maintain transparency between multiple teams who operate on the master data from the customer, a reference key can be used since a common data source is not practical in situations where the dev teams have to duplicate and download the data multiple times. The common key for column definitions, might not be the ideal solution to avoid data discrepancy, but it is an improvement on the existing process where a data column can be easily misinterpreted, especially when the column names are non-explanatory.
- Using verified code sources: To shorten the time taken for a security check while uploading or approving machine learning libraries on production servers, the developers can restrict their code references to verified codebases such as TensorFlow and scikit-learn. Contrib libraries, if used, must be thoroughly checked by the developer to verify the input and output points. This is because, in case there is some security concern, the cycle of re-development and security check restarts and might insanely slow down the process.
When MLOps is a technical Challenge
The primary technical challenge for MLOps appears in the third stage where the solution is being developed and deployed in parallel. Here are a few technical challenges that are specific to MLOps and quite removed from DevOps:
- Hyperparameter versioning – Every ML model has to be tested with multiple sets of hyperparameter combinations before the optimal solution is found. However, that is not the primary challenge. Change in incoming data can dip the performance of the chosen combination and the hyperparameters must be re-tweaked. Even though the code and hyperparameters are controlled by the developers, the data is the independent factor that influences the controlled elements. It must be ensured that every version of the data and hyperparameters are tracked so that the optimal result can be found and reproduced with minimum hassle.
- Multiple experiments – With changing data, a lot of code also has to be modified across data processing, feature engineering, and model optimization. Such iterations cannot be planned and add to the time required in building the ML solution, impacting all the following stages and planned timelines.
- Testing and Validation – The solution must be tested and validated across multiple unseen datasets to understand if the solution can work in the production environment with incoming data. The challenge here is that the trained model may be skewed towards the training data set. To avoid this, testing and validation must be done in iterations during model creation itself. Given that the number of modeling iterations are not fixed, the time taken for testing with unseen datasets keeps increasing with the number of iterations.
- Post-deployment monitoring – The primary challenge here is that new data may not be consistent with historical data patterns. For instance, during the pandemic, the stock market curves defied all historical patterns and went down in a way a machine learning solution could not have predicted. Such external factors often come into play and the data must be collected, updated, and processed accordingly to maintain the recency of the solution.
The one stop solution for addressing most of the tech challenges is documenting the entire process systematically, such that there is high visibility and ease of access and navigation. Logging, storing, organizing, and comparing the model versions and outputs is key to get optimal visibility and result from dozens of iterations. Neptune.ai platform manages all model building metadata in a single place, and in addition to log, store, display, organize and compare, it even offers the feature to query all the MLOps metadata.
Learn more about Neptune’s features and how it can help you organize your ML workflow.
MLOps, as a concept, is very novel and even though it is barely touching the sphere of the Data Science and Machine Learning community today, it is making its mark by emphasizing that it is indeed the need of the hour. Especially when the number of ML implementations is growing exponentially. By resolving some of the organizational and communication issues, a lot of excess hustle can be lifted off MLOps to make it more developer and operation-friendly.