This Is Our MLOps Tool Stack: Continuum Industries
Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.
Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints.
“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.” Andreas Malekos Chief Scientist @Continuum Industries
But creating and operating the Optioneer engine is more challenging than it seems:
- The objective function does not represent reality;
- There are a lot of assumptions that civil engineers don’t know in advance;
- Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those.
Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.
How Continuum Industries Set Up CI/CD for the Infrastructure Design Optimization Engine [Case Study]
As an engine team, we leverage a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off. We also use various tools and frameworks.
Our MLOps tool stack
Right now, our objective is to develop and adopt robust QA processes that will ensure that solutions returned to our end users are:
- Good, meaning that it is a result that a civil engineer can look at and agree with.
- Correct, meaning that all the different engineering quantities that are calculated and returned to the end user are as correct as possible.
Continuous monitoring and repeatable testing of our engine’s performance will sit at the centre of these processes.
The “engine” code is written in Python, and we use the usual suspects there: scipy, numpy, pandas, and matplotlib.
We have a set of “test problems” that the algorithm is run against. Each test problem is defined by a configuration and a relatively large file containing the geospatial data required. The code is versioned using git and the geospatial data is versioned using DVC.
Usually, when we try to make an improvement to the algorithm, we start with one of those test problems and run it with whatever modifications we wish to make. We usually use Neptune in this part of the process in order to track experiments. This allows us to easily look back at what we have tried up until now and plan the next steps.
Once we’ve managed to produce better results in that one test problem, we expand our testing to the full set of test problems. Developers can run the full testing pipeline (see image below) and then compare the results to the latest run on the master branch. This allows us to tell whether there is a statistically significant improvement across all our test problems after we’ve made the change. This pipeline runs on Github Actions and uses a tool called CML to deploy EC2 instances.
If we’re satisfied with the result, we then get to work “productising” whatever code we wrote: we clean it up, write unit tests, etc. For unit tests, we use pytest and Hypothesis. The latter is a clever data generator for python unit tests, that usually allows us to find edge cases that break our code and handle them.
For the algorithm itself, we use a heavily modified version of the Platypus library in python. We’ve made extensive modifications that allow for better parallel computing and that better suit our use case in general.
Finally, we use Ray to parallelise computations. We’ve found it to be much quicker than the multiprocessing module in python and it has the potential to easily horizontally scale across multiple machines (though we haven’t taken advantage of that functionality just yet).
For our cloud, we also use Kubernetes, as well as Argo workflows.
What we like about our current setup
- The whole testing pipeline is unit tested with every PR, which means that it runs reliably and doesn’t fail on us.
- All the relevant data is recorded in one place: git version, metrics from multiple runs, configuration data, etc.
- Through the CUSTOM_RUN_ID functionality, we can easily process and aggregate data at different stages of our CI/CD pipeline, and then update the relevant run.
- Neptune’s UI is fairly sleek and it makes tracking, tagging and managing runs very easy.
- Since everything is versioned (code with git, and geospatial data through DVC), repeatability is much more achievable.
What we don’t like about our current setup
- Our runners are spot instances and so they get killed by AWS more often than we’d like, which means we need to restart the pipeline.
- Because the full pipeline takes a couple of hours to run, and because of how the rest of our CI/CD is set up, it doesn’t deploy anything. Rather, it acts as a “sanity check” to see whether we broke anything in the previous week or not. Ideally, we’d want to tie approval of results to deployment of a new version of our algorithm.
- The final step of the process is still manual. A human has to go in and run a script that downloads data from 2 runs and executes a statistical comparison between them in order to tell us which one is better. Because it’s manual, it’s not really part of the pipeline and therefore that information isn’t stored in Neptune, which is a shame.