📣 BIG NEWS: Neptune is joining OpenAI! → Read the message from our CEO 📣

Case Study

How Theta Tech AI Tracks 1000s of Training Jobs Running on AWS With Neptune

We tested multiple platforms by running experiments on them. It was clear that Neptune was the right tool for us. It's an excellent choice for users with large-scale training activities.
Dr. Robert Toth
Founder of Theta Tech AI
Before
    Kept losing a lot of time on checking irrelevant data in AWS CloudWatch
After
    Have customizable experiment tracking dashboards that provides relevant insights and visualizations

Theta Tech AI specializes in building advanced AI-driven healthcare systems. Their technology leverages deep learning to process medical images and signals, aiming to revolutionize how hospitals operate by integrating Artificial Intelligence into daily practices.

The challenge

The ML team at Theta Tech uses Amazon Web Services (AWS) GPU servers to run training workloads. They store the datasets on AWS S3 — they’ve integrated their stack into the AWS ecosystem. They develop ML models using fastai and PyTorch APIs. And they use Optuna to specify and optimize the model hyperparameters.

The team runs several studies that require them to keep track of thousands of experiments for large-scale parallel training workflows. Since the training workload happens on GPU servers in AWS, the natural choice for monitoring the jobs was AWS CloudWatch Logs. However, quite soon, they started to face significant constraints when they tried to analyze the outputs from CloudWatch Logs.

“The limitation of AWS CloudWatch Logs is that we are logging many things. You don’t only get experiment-relevant metrics, but you also get server logs, and other things CloudWatch thinks are useful. It’s very manual. We could have done more productive things rather than going through all the logs.” – says Silas Bempong, Data Scientist at Theta Tech AI

It just wasn’t inadequate for managing experiment logs.

Teams leveraging the public cloud for ML training jobs often encounter this challenge. Cloud-native logging tools are usually not purpose-built to help them productively manage the experimentation process. 

Monitoring and Debugging Large-Scale Parallel Training Workflow

Transitioning from AWS CloudWatch, which cluttered experiment tracking with non-essential data, Theta Tech AI adopted Neptune to effectively monitor thousands of parallel training jobs from AWS. No more sifting through irrelevant logs.

On the one hand, Neptune’s robust tracker could scale to track all the jobs operating on different compute clusters. On the other hand, it provided relevant dashboards and a highly customizable user interface, allowing for precise tracking of necessary metrics and metadata.

Not only can they know exactly the information they are most interested in, but they can also rapidly identify and resolve training issues. And thanks to the real-time monitoring that Neptune provides, they can quickly identify underperforming experiments.

avatar lazyload
quote
Neptune is an excellent choice for users with large-scale training activities, as the Neptune workflow is already set up to handle the most common model training scenarios. It’s a great tool to organize deep learning experiments, whereas most people have home-grown dashboards or, worse, no dashboards.
Dr. Robert Toth Founder of Theta Tech AI
Example table with multiple experiments recorded in Neptune | Courtesy of Theta Tech AI

Organizing Hyperparameter Search Experiments More Effectively

Since most of the team’s experiments are Optuna-based, they needed a solution focused on experiment tracking that could interact with Optuna to track hyperparameters. Again, CloudWatch failed.

“Let’s say we are running 15 or 30 experiments simultaneously and Optuna samples for hyperparameters with its sampling strategies. The hyperparameters we sample for each experiment are logged on the servers, and they start running those experiments. The problem is that we can only track how the experiments are running through the CloudWatch Logs. Otherwise, we would have to write them in a log file and get them.” – says Abhijit Ramesh, ML Engineer at Theta Tech AI

Neptune’s integration with Optuna transformed their workflow, enabling systematic tracking and comparison of experiments.

avatar lazyload
quote
Neptune and Optuna go hand in hand. This integration allowed for an efficient organization of hyperparameter search experiments, making our model development process both faster and more precise.
Dr. Robert Toth Founder of Theta Tech AI

Grouping Runs for Easier Analysis

Neptune’s grouping and filtering features significantly enhance Theta Tech AI’s ability to analyze experiments by organizing them into meaningful cohorts based on validation datasets and model configurations.

This functionality is particularly useful in their application of five-fold cross-validation, where it allows the team to efficiently evaluate model performance on diverse patient groups, pinpointing specific generalization issues and identifying which validation groups perform well or poorly, thus solving the problem of assessing model robustness across varied clinical scenarios.

avatar lazyload
quote
Grouping by validation set is super important to us, and many other people would benefit from using the grouping features with validation.
Dr. Robert Toth Founder of Theta Tech AI

Dashboards and Views for Effective Results Communication

Finally, Theta Tech AI’s business strategy depends on sharing and communicating with clients the findings of research projects based on machine learning experiments. The team could not convey or visualize the experiment results using CloudWatch logs.

To view the experiment logs, they would need to employ third-party visualization tools, such as Grafana or ElasticSearch, which would increase the complexity of their stack.

With Neptune, they can create custom dashboards presenting insights into the hyperparameter search and all the information about the performance of each model. It is now essential for internal reviews and external communication. The interactive dashboards provided a dynamic way to present findings and progress to stakeholders, enhancing transparency and collaboration.

avatar lazyload
quote
Neptune provided a solution for Optuna’s single-user dashboard view in the form of an Optuna integration, which made it possible to deliver dashboards in a shared and collaborative manner that did not take minutes to render.
Dr. Robert Toth Founder of Theta Tech AI
Custom dashboard created by Theta Tech AI to present hyperparameter search results | Courtesy of Theta Tech AI

Results

  • Simplified monitoring and debugging processes that support large-scale ML operations.
  • Enhanced ability to group and analyze runs, facilitating deeper insights and more informed decision-making.
  • Ability to reproduce best results and provide insights into how they were received.
  • Improved stakeholder engagement with interactive dashboards that clearly communicate findings and progress.

Thanks to Dr. Robert Toth, Abhijit Ramesh, and Silas Bempong for working with us to create this case study!

avatar
quote
We had a time when results started to dip, and we used Neptune to replicate our former best results, and this would not have been possible before using Neptune. It’s also an excellent audit log to recall what we changed with each new AI study.
Dr. Robert Toth Founder of Theta Tech AI

Looking for an experiment tracker that will easily integrate with the existing stack?