What is MLOps?
MLOps, or DevOps for machine learning, is a practice that aims to improve the project management, communication, and collaboration between ML team and operations professionals in the development, deployment, and management of machine learning models. MLOps involves using tools and processes to automate the building, testing, and deployment of machine learning models, as well as the monitoring and management of these models in production to improve productivity, repeatability, reliability, auditability, data, and model quality.
Although MLOps can provide valuable tools to help you scale your business, you might face certain issues as you integrate MLOps into your machine learning workloads. To implement MLOps, organizations may use a variety of tools and techniques, like version control systems, CI/CD pipelines, building or packing processes, and infrastructure as code (IaC) for provisioning or configuration.
Is it all about building everything as a process with code?
CI/CD is a continuous workflow that aims to iterate over the development and deployment of the product to improve and upgrade it. Continuous Workflows (CX) generalize this concept and deliver the purpose with the same goal. For managing the machine learning products’ life cycles, we have to define new processes which were not a concern before, for example, continuous monitoring of models, evaluation of new model releases, and automatic data collection. Worth mentioning that you also need to think about creating a platform that maintains these new processes, something similar to Git maintaining its workflows and actions, since they also have their own life cycles.
Benefits for IoT edge projects
Overall, implementing MLOps in an IoT edge company is important because IoT systems often involve relatively large volumes of data, high levels of complexity, and real-time decision-making and can help to ensure that machine learning models are developed and deployed efficiently and that they remain reliable and accurate over time. MLOps can help to ensure that everyone is working towards the same goals and that any issues or challenges can be identified and addressed in a timely manner. This can enable the company to leverage the data generated by its IoT edge devices to drive business decisions and gain a competitive advantage.
Designing the MLOps system
It’s important to note that implementing MLOps practices can be challenging and may require significant investment in terms of time, resources, and expertise. Taking them into account, different cloud providers nowadays offer services and tools with many hypothetical scenarios and solutions, nevertheless, the reality is often far from what you read in blogs and articles.
AWS offers a three-layered machine learning stack to choose from based on your skill set and team’s requirements for implementing workloads to execute machine learning tasks. If the machine learning tasks required by your use cases can be implemented using AI services, then you don’t need an MLOps solution. On the other hand, if you use either the ML services or ML frameworks and infrastructure, it is recommended you implement an MLOps solution that will be conceptually determined by the details of your use cases, for example, how much the environment would be different if you build your codes running on AWS Greengrass in a container or manage them by the operating system, either case deployed on edge devices.
Perform an assessment of requirements as the first step
To make the long story short, it all comes back to you as an MLOps engineer to put together and design an environment for all type of Ops demands in your company and tailor it to the inner and across team’s work and plans. You also need to have a good vision during the design of new processes for the data, machine learning models, and the code life cycles not only based on IT standards but also considering their feasibility of them.
I went through several rounds of interviews and discussions with different teams to hear what they expect to achieve, mainly the ML team, and shared with them the expectations of other teams as well. The result of this assessment process led to conceptualizing and designing a framework that offers an environment for building, managing, and automating processes or workflows with which the data, models, and code Ops based on the needs of individuals and across teams can be realized.
The environment architecture is shown in figure 1 as a whole with all its resources. I break it down here and explain how things are plumbed, how it operates, and how an MLOps engineer can lead the development and deployment of a new process using and within it.
As the very first step, I would recommend creating a new role on AWS and setting up everything under this role as a good practice. A secured environment can be implemented using a custom virtual private cloud (VPC) and by implementing security groups and routing the internet traffic via the custom VPC. Moreover, this role needs to be granted the proper rights and permissions to create, for example, rules for events, Lambdas, layers, and logs. Let’s go through it step by step:
- Twin SageMaker notebooks with required permissions granted by the environment role:
– Development twin is an environment for prototyping and testing the workflow’s script, and it has permission to create/update/delete some resources like events, Lambdas, and logs.
– Deployment twin is strictly and solely for QA’ed processes and workflows that will be assembled together here.
- RedShift, the data source and can be backend-connected to other sources and databases or replaced by any other type of data source like S3 buckets or DynamoDB. The twin has permission to read and write (if needed) to it. Imagine, as a use case, you need to access different databases by writing a complex query and then writing a summary to another table.
- Workflow consists of a few elements working together to automate a process:
– EventBridge as an automatic trigger for Lambda functions like cron jobs or events
– Lambda function for passing the script to the deployment twin for execution
– Backup Lambda function for the case of failure, and it can be invoked from the development twin or manually
– CloudWatch for logging issues, tracking reports or setting up business intelligence
- SNS service for automatic sub/pub reporting via email.
- CloudWatch for cost tracking.
- Gitlab is for repositories maintaining workflows scripts (also for control plane resources), and it is designed in a way that deployment proceeds with a push request.
Walkthrough guide for development
- As a developer, I can log in to development twin, and write the script for a task or workflow, fetch or push some data from different tables or buckets and use SageMaker SDK to do some orchestrator’s magic .
- After finishing the development, going through the QA, and eventually updating the repository with the script, it is time to build the rest of the workflow, meaning to create triggers, lambda functions, and log groups and streams.
- Figure 2 shows the entire resources associated with the development twin’s role policy. I actually call the development twin the control plane since you can create, update and delete all the resources needed for a workflow. “Boto3”, AWS python SDK, lets you manage all these resources, and believe me, using CloudFormation or Terraform is too much of a hassle .
- I gradually put together my scripts and created a toolbox to interact with these resources and also specified an S3 bucket for the twins with version control enabled for holding models or datasets which are the result of these processes or workflows.
Walkthrough guide for deployment
- Figure 3 gives you a good picture of all pieces involved in a deployed workflow. After assembling a workflow, the lambda function that is invoked by the event will pass the workflows repository and tell the SageMaker notebook which script or notebook needs to be executed.
- The Lambda function can communicate with the notebook via WebSocket (even “Boto3” can let you turn on/off the notebooks). The following piece of Python code shows how to interact with notebook API. However, the WebSocket library should be added as a layer to the Lambda function. While executing the script, any issues will be logged for tracking, and the final report is sent to the teams that are subscribed to the topic.
A typical day as an MLOps engineer
Imagine yourself working for a maintenance company that offers a service with machine learning models built within AWS Greengrass, running on hundreds of IoT edge devices to do inference on the sensor recordings. These devices then submit the data streams and results of (pro)active decision-making to the backend servers to be stored on buckets and tables. Based on what I have experienced, this is what your day-to-day work might look like.
- You are in a strategic meeting, and the team leads are discussing plans for establishing new workflows based on querying tables and doing certain analyses with machine learning models and eventually sending the reports via email. Teams are understaffed, and many tasks need to be automated.
- The ML team lead wants you to schedule them daily for evaluating the newly deployed machine learning models, comparing different versions, and continuously collecting some data for model drift based on predefined performance metrics.
- The product team seems to be trying to seal a deal with a big client, and they want you to create and send them monthly analytics on the newly released features that the ML team deployed on the devices.
- The QA team manager has identified issues reported by the live processes that run on staging devices for monitoring the machine learning models. It may be necessary to take a closer look at the systems and processes in place to identify the root cause of the problems. One approach to troubleshooting these issues could involve reviewing the logs and metrics generated by the machine learning models with building a monitoring workflow to identify any patterns or abnormalities that may be indicative of any problem and running the debug routines when it happens.
It is so far nothing more than building a few workflows and continuous workflows (CX) that you need to write scripts for each and create cron jobs, Lambdas and push some metrics to logs for monthly analytics.
MLOps for IoT edge: challenges and points to consider
Last but not least, time to share some challenges that I came across while setting up the environment.
- I would suggest that working on improving your Terraform skills is a good investment and do not stop yourself there since many companies also adopted Terragrunt to keep the IaC code repositories dry [3, 4]. What stuck in my mind while working with IaC was granting cross-environment permissions like “Staging” or “Production” to MLOps, which need to be defined in the target environment and then be granted to the MLOps role policy. You can later reach these resources by specifying the profile name in your “Boto3” session. Here is a scratch of doing it correctly within the .hcl config file.
- I came to this realization that many companies would adopt bash scripting with serverless in Git for the CD pipeline despite the traditional Git’s CI being incapable of manifesting machine learning artifacts testing. I decided to set up an integrated Apache Airflow for managing and executing the CI testing processes and workflows. Worth mentioning that I used the development twin for training tasks, nevertheless, it can also be automated as a workflow.
- Lambda functions retries option should be fixed to one and timeout to around one minute otherwise, the workflow can be invoked more than once.
- Graph dashboards are very handy for visualizing and presenting numbers and some statistics but maintaining and keeping them updated appears to be a continuous duty.
- Tips to remember, everything except the SageMaker folder in the notebook instances will not persist and set alarms for the failure of workflows, you will thank yourself later.
My purpose for writing this article was to take you through a different path by looking at MLOps as an architectural design which is the outcome of assessing the Ops requirements of the product or services that your company offers. It is not an ultimate or universal solution for implementing MLOps, but it gives you a few ideas about how to plumb different pieces together if you have a good sense of IT standards and practices to achieve the purpose.
Speaking of adding features to the platform, I can imagine integrating Apache Airflow into the platform and using it for automating some specific tasks. Moreover, Ansible seems to have useful functionalities to deploy processes or workflows for some tasks “on device”. Launch an EC2 instance and integrate it into the MLOps environment for hosting both of them. There are ample options out there, and in the end, it is up to you to make the right call depending upon your requirements.