How Theta Tech AI Tracks 1000s of Training Jobs Running on AWS With Neptune
Theta Tech AI specializes in building advanced AI-driven healthcare systems. Their technology leverages deep learning to process medical images and signals, aiming to revolutionize how hospitals operate by integrating Artificial Intelligence into daily practices.
The challenge
The ML team at Theta Tech uses Amazon Web Services (AWS) GPU servers to run training workloads. They store the datasets on AWS S3 — they’ve integrated their stack into the AWS ecosystem. They develop ML models using fastai and PyTorch APIs. And they use Optuna to specify and optimize the model hyperparameters.
The team runs several studies that require them to keep track of thousands of experiments for large-scale parallel training workflows. Since the training workload happens on GPU servers in AWS, the natural choice for monitoring the jobs was AWS CloudWatch Logs. However, quite soon, they started to face significant constraints when they tried to analyze the outputs from CloudWatch Logs.
“The limitation of AWS CloudWatch Logs is that we are logging many things. You don’t only get experiment-relevant metrics, but you also get server logs, and other things CloudWatch thinks are useful. It’s very manual. We could have done more productive things rather than going through all the logs.” – says Silas Bempong, Data Scientist at Theta Tech AI
It just wasn’t inadequate for managing experiment logs.
Teams leveraging the public cloud for ML training jobs often encounter this challenge. Cloud-native logging tools are usually not purpose-built to help them productively manage the experimentation process.
Monitoring and Debugging Large-Scale Parallel Training Workflow
Transitioning from AWS CloudWatch, which cluttered experiment tracking with non-essential data, Theta Tech AI adopted Neptune to effectively monitor thousands of parallel training jobs from AWS. No more sifting through irrelevant logs.
On the one hand, Neptune’s robust tracker could scale to track all the jobs operating on different compute clusters. On the other hand, it provided relevant dashboards and a highly customizable user interface, allowing for precise tracking of necessary metrics and metadata.
Not only can they know exactly the information they are most interested in, but they can also rapidly identify and resolve training issues. And thanks to the real-time monitoring that Neptune provides, they can quickly identify underperforming experiments.
Organizing Hyperparameter Search Experiments More Effectively
Since most of the team’s experiments are Optuna-based, they needed a solution focused on experiment tracking that could interact with Optuna to track hyperparameters. Again, CloudWatch failed.
“Let’s say we are running 15 or 30 experiments simultaneously and Optuna samples for hyperparameters with its sampling strategies. The hyperparameters we sample for each experiment are logged on the servers, and they start running those experiments. The problem is that we can only track how the experiments are running through the CloudWatch Logs. Otherwise, we would have to write them in a log file and get them.” – says Abhijit Ramesh, ML Engineer at Theta Tech AI
Neptune’s integration with Optuna transformed their workflow, enabling systematic tracking and comparison of experiments.
Grouping Runs for Easier Analysis
Neptune’s grouping and filtering features significantly enhance Theta Tech AI’s ability to analyze experiments by organizing them into meaningful cohorts based on validation datasets and model configurations.
This functionality is particularly useful in their application of five-fold cross-validation, where it allows the team to efficiently evaluate model performance on diverse patient groups, pinpointing specific generalization issues and identifying which validation groups perform well or poorly, thus solving the problem of assessing model robustness across varied clinical scenarios.
Dashboards and Views for Effective Results Communication
Finally, Theta Tech AI’s business strategy depends on sharing and communicating with clients the findings of research projects based on machine learning experiments. The team could not convey or visualize the experiment results using CloudWatch logs.
To view the experiment logs, they would need to employ third-party visualization tools, such as Grafana or ElasticSearch, which would increase the complexity of their stack.
With Neptune, they can create custom dashboards presenting insights into the hyperparameter search and all the information about the performance of each model. It is now essential for internal reviews and external communication. The interactive dashboards provided a dynamic way to present findings and progress to stakeholders, enhancing transparency and collaboration.
Results
- Simplified monitoring and debugging processes that support large-scale ML operations.
- Enhanced ability to group and analyze runs, facilitating deeper insights and more informed decision-making.
- Ability to reproduce best results and provide insights into how they were received.
- Improved stakeholder engagement with interactive dashboards that clearly communicate findings and progress.
Thanks to Dr. Robert Toth, Abhijit Ramesh, and Silas Bempong for working with us to create this case study!