Theta Tech AI builds customized artificial intelligence algorithms and front-end user interfaces for large-scale healthcare AI systems. Its main objective is to build hospitals in the cloud powered by AI.
Their products are image and signal-processing tools that detect anomalies indicating health risks.
About the team
The team comprises seven engineers building AI systems for healthcare businesses. The group focuses on developing generalizable medical AI systems representative of the real world. These systems are deployed in hospitals to help healthcare providers increase clinical effectiveness and efficiency.
The team works with 1D ECG signals, 2D X-rays, or 3D magnetic resonance imaging (MRI) medical and biological datasets. They offer various analytical services, from data preprocessing, pattern recognition, and classification to model testing and validation.
The team uses the Amazon Web Services (AWS) GPU servers to run training workloads. They store the datasets on AWS S3 — they’ve integrated their stack into the AWS ecosystem. They develop machine learning models using fastai and PyTorch APIs. They use Optuna to specify and optimize the model hyperparameters and Neptune.
The production workload runs on Microsoft Azure. They download the weights of their saved models (neural networks, most of the time) from the Neptune dashboard for production and then push them to Azure.
We prefaced this case study with what Theta AI’s engineering team focuses on: “developing generalizable medical AI systems representative of the real world.” There are several roadblocks to developing generalizable machine learning (ML) models—training low-quality models is right up there (well, alongside using poor-quality data).
The team would run several studies that required them to keep track of thousands of experiments for large-scale parallel training workflows. Since the training workload happened on GPU servers in AWS, the natural choice for monitoring the jobs was AWS CloudWatch Logs. The team began to see some significant constraints when they tried to analyze the outputs from CloudWatch Logs.
Realizing the insufficiencies of Cloud Logging Services to manage experiments at scale
The team found that AWS CloudWatch Logs was inadequate for managing experiment logs. Teams leveraging the public cloud for ML training jobs often encounter this challenge. Cloud-native logging tools are usually not purpose-built to help them productively manage the experimentation process.
The Theta Tech AI team was unable to perform a few tasks that were essential to their workflow:
- Get experiment-relevant metrics from AWS CloudWatch Logs
- Productively debug problems with training jobs and experiments
- Optuna integration for hyperparameter optimization
- Communicate the results of ML models to clients
Since most of the team’s experiments are Optuna-based, they needed a solution focused on experiment tracking that could interact with Optuna to track hyperparameters and offer collaborative features.
Neptune ended up being the ideal choice for them to achieve their objective. Beyond experiment tracking, Neptune offers experiment grouping and filtering functionality that more expensive competitors lacked, in addition to native integration with Optuna.
Criteria Theta Tech AI considered for an ideal solution. The team outlined four (4) criteria that an ideal experiment tracking solution for them should have:
- 1 Integration with open-source tools proven to work well and maintained by a community of developers
- 2 Provide real-time support
- 3 Easy to interpret visualizations
- 4 Easy to develop
Settling for an ideal solution
Neptune met the criteria the team outlined and provided the following solutions:
- It helps track thousands of training jobs running on AWS at scale
- It offers seamless Neptune-Optuna integration
- It features an interactive real-time dashboard for Optuna
- It provides a grouping and filtering feature valuable for organizing experiments
- Setting up and integrating with the existing stack is easy without provisioning a separate infrastructure
Neptune helps Theta Tech AI track thousands of training jobs running on AWS at scale
After using Neptune, the team could finally track and view only the metrics and files necessary for their research projects. They discovered that Neptune could scale to track all the jobs operating on different compute clusters when they ran thousands of training runs at scale on AWS.
Neptune provides relevant dashboards and an interactive user interface to monitor their training jobs and hardware utilization and share the dashboards as reports amongst colleagues and relevant clients.
Neptune removed the CloudWatch Logs barrier of leveraging external visualization tools like Grafana and provides secure and collaborative options. Additionally, it enforced experiment lineage, making it simple for the team to review earlier experiments, troubleshoot them, and reproduce their results.
Seamless Neptune-Optuna integration to make hyperparameter optimization simple
The team wanted a solution that could effortlessly integrate with Optuna. They sampled hyperparameters using Optuna and needed a simple way to display the results of different hyperparameter groups.
They would train several models in parallel on AWS using the hyperparameters, and they would need to be able to:
- Discover the model’s performance throughout training, testing, and validation
- Learn how well the hyperparameter worked
The Neptune-Optuna integrated dashboard gives them insights into how well the hyperparameters performed and offers all the information about the performance of each model.
Leveraging Neptune’s real-time dashboard over Optuna’s web dashboard
One of the team’s main issues was that Optuna was designed to have a single person view the default dashboard while running small experiments. This wasn’t sufficient for their requirements because they needed to share insights with their internal development team and clients.
They could see the Optuna optimization process for experiments in real-time, thanks to the Neptune-Optuna integration. They could share dashboards with coworkers and clients participating in the project, facilitating communication and cooperation because everyone could see the dashboard at once.
Neptune provides Theta Tech AI experiment grouping and filtering features
Neptune improved how the team utilized compute resources for their training and data processing jobs. For example, running large data processing jobs on distributed clusters is one of the most compute-intensive tasks for the team. Neptune provided them with better insights into how their image data augmentation programs utilized resources to maximize their GPU usage.
By combining and filtering the results by validation datasets, Neptune helps the team organize the experiments productively. The grouping feature is essential because it allows them to train models on some patients and test and validate them on others after splitting up the dataset.
A system randomly selects which patients will be in the validation set for a given experiment and then group it by validation set. They’d have five groups because they use five-fold cross-validation and could very quickly see how the models did on each validation group of patients.
They use Neptune to filter the validation groups and notice which validation groups do well and those that do poorly. When they group the experiments, they see the groups as the validation sets and can analyze which patients the studies are not generalizing to.
Neptune is easy to set up and integrate with the existing stack without provisioning a separate infrastructure
Neptune provides a wide range of integrations and support for open-source tools used in the industry. The Theta Tech AI team found it helpful since it makes it simple to set up and run using tools like fastai and Optuna. Most of the callbacks for these tools were already set up in Neptune with no additional configurations or code to log and track experiment-relevant metrics.
The team also found Neptune easy to set up and get started with compared to tools like MLFlow, which would require additional configuration and provisioning on the team’s server.
Adding Neptune to the Theta Tech AI team’s tool stack proved valuable because:
Thanks to Dr. Robert Toth, Abhijit Ramesh, and Silas Bempong for working with us to create this case study!