Theta Tech AI
Theta Tech AI builds customized artificial intelligence algorithms and front-end user interfaces for large-scale healthcare AI systems. Its main objective is to build hospitals in the cloud powered by AI.
Their products are image and signal-processing tools that detect anomalies indicating health risks.
About the team
The team comprises seven engineers building AI systems for healthcare businesses. The group focuses on developing generalizable medical AI systems representative of the real world. These systems are deployed in hospitals to help healthcare providers increase clinical effectiveness and efficiency.
The team works with 1D ECG signals, 2D X-rays, or 3D magnetic resonance imaging (MRI) medical and biological datasets. They offer various analytical services, from data preprocessing, pattern recognition, and classification to model testing and validation.
- Team size: 7 engineers
- Technology stack:
- Amazon Web Services (GPU servers for training)
- Microsoft Azure (production servers)
Neptune use case: experiment tracking for clinical research
The team uses the Amazon Web Services (AWS) GPU servers to run training workloads. They store the datasets on AWS S3 — they’ve integrated their stack into the AWS ecosystem. They develop machine learning models using fastai and PyTorch APIs. They use Optuna to specify and optimize the model hyperparameters and Neptune.
The production workload runs on Microsoft Azure. They download the weights of their saved models (neural networks, most of the time) from the Neptune dashboard for production and then push them to Azure.
We prefaced this case study with what Theta AI’s engineering team focuses on: “developing generalizable medical AI systems representative of the real world.” There are several roadblocks to developing generalizable machine learning (ML) models—training low-quality models is right up there (well, alongside using poor-quality data).
The team would run several studies that required them to keep track of thousands of experiments for large-scale parallel training workflows. Since the training workload happened on GPU servers in AWS, the natural choice for monitoring the jobs was AWS CloudWatch Logs. The team began to see some significant constraints when they tried to analyze the outputs from CloudWatch Logs.
Realizing the insufficiencies of Cloud Logging Services to manage experiments at scale
The team found that AWS CloudWatch Logs was inadequate for managing experiment logs. Teams leveraging the public cloud for ML training jobs often encounter this challenge. Cloud-native logging tools are usually not purpose-built to help them productively manage the experimentation process.
The Theta Tech AI team was unable to perform a few tasks that were essential to their workflow:
- Get experiment-relevant metrics from AWS CloudWatch Logs
- Productively debug problems with training jobs and experiments
- Optuna integration for hyperparameter optimization
- Communicate the results of ML models to clients
“The limitation of AWS CloudWatch Logs is that we are logging many things. You don’t only get experiment-relevant metrics, but you also get server logs, and other things CloudWatch thinks are useful. It’s like a manual process where I go through every component.” – Silas Bempong, Data Scientist at Theta Tech AI
The team would use CloudWatch Logs to monitor and troubleshoot their entire stack running on AWS. The problem became apparent when they realized it would take additional time and effort to sort and filter experiment-related metrics manually.
“There was a situation where one of the experiments failed due to dependency conflict, and we had to dig through all the logs to get that, which helped us debug. However, in terms of efficiency, we could have done more productive things rather than go through the logs.” – Silas Bempong, Data Scientist at Theta Tech AI
The team struggled to effectively troubleshoot model training problems since it was challenging to sift and filter experiment-related metrics from CloudWatch Logs. Additionally, they could not keep track of the model training information in real-time and quickly identify underperforming experiments.
“Let’s say we are running 15 or 30 experiments simultaneously and Optuna samples for hyperparameters with its sampling strategies. The hyperparameters we sample for each experiment are logged on the servers, and they start running those experiments.
The problem is that we can only track how the experiments are running through the CloudWatch Logs. Otherwise, we would have to write them in a log file and get them.” – Abhijit Ramesh, ML Engineer at Theta Tech AI
Hyperparameter optimization is crucial to the team’s experimentation workflow and makes it efficient. Without utilizing other tools or creating scripts to analyze the logs, they found it challenging to sort through the Optuna-based experiments and comprehend the results because of CloudWatch Logs.
“In CloudWatch, we don’t have visuals and graphics, which is a big thing for me.” – Dr. Robert Toth, Founder of Theta Tech AI
Theta Tech AI’s business strategy depends on sharing and communicating with clients the findings of research projects based on machine learning experiments. The team could not convey or visualize the experiment results using CloudWatch logs.
They would need to employ third-party visualization tools, such as Grafana or ElasticSearch, to view the experiment logs, increasing their stack’s complexity.
Since most of the team’s experiments are Optuna-based, they needed a solution focused on experiment tracking that could interact with Optuna to track hyperparameters and offer collaborative features.
Neptune ended up being the ideal choice for them to achieve their objective. Beyond experiment tracking, Neptune offers experiment grouping and filtering functionality that more expensive competitors lacked, in addition to native integration with Optuna.
Criteria Theta Tech AI considered for an ideal solution. The team outlined four (4) criteria that an ideal experiment tracking solution for them should have:
- 1 Integration with open-source tools proven to work well and maintained by a community of developers
- 2 Provide real-time support
- 3 Easy to interpret visualizations
- 4 Easy to develop
Settling for an ideal solution
Neptune met the criteria the team outlined and provided the following solutions:
- It helps track thousands of training jobs running on AWS at scale
- It offers seamless Neptune-Optuna integration
- It features an interactive real-time dashboard for Optuna
- It provides a grouping and filtering feature valuable for organizing experiments
- Setting up and integrating with the existing stack is easy without provisioning a separate infrastructure
Neptune helps Theta Tech AI track thousands of training jobs running on AWS at scale
After using Neptune, the team could finally track and view only the metrics and files necessary for their research projects. They discovered that Neptune could scale to track all the jobs operating on different compute clusters when they ran thousands of training runs at scale on AWS.
Neptune provides relevant dashboards and an interactive user interface to monitor their training jobs and hardware utilization and share the dashboards as reports amongst colleagues and relevant clients.
Neptune removed the CloudWatch Logs barrier of leveraging external visualization tools like Grafana and provides secure and collaborative options. Additionally, it enforced experiment lineage, making it simple for the team to review earlier experiments, troubleshoot them, and reproduce their results.
Seamless Neptune-Optuna integration to make hyperparameter optimization simple
The team wanted a solution that could effortlessly integrate with Optuna. They sampled hyperparameters using Optuna and needed a simple way to display the results of different hyperparameter groups.
They would train several models in parallel on AWS using the hyperparameters, and they would need to be able to:
- Discover the model’s performance throughout training, testing, and validation
- Learn how well the hyperparameter worked
The Neptune-Optuna integrated dashboard gives them insights into how well the hyperparameters performed and offers all the information about the performance of each model.
Leveraging Neptune’s real-time dashboard over Optuna’s web dashboard
One of the team’s main issues was that Optuna was designed to have a single person view the default dashboard while running small experiments. This wasn’t sufficient for their requirements because they needed to share insights with their internal development team and clients.
They could see the Optuna optimization process for experiments in real-time, thanks to the Neptune-Optuna integration. They could share dashboards with coworkers and clients participating in the project, facilitating communication and cooperation because everyone could see the dashboard at once.
Neptune provides Theta Tech AI experiment grouping and filtering features
Neptune improved how the team utilized compute resources for their training and data processing jobs. For example, running large data processing jobs on distributed clusters is one of the most compute-intensive tasks for the team. Neptune provided them with better insights into how their image data augmentation programs utilized resources to maximize their GPU usage.
By combining and filtering the results by validation datasets, Neptune helps the team organize the experiments productively. The grouping feature is essential because it allows them to train models on some patients and test and validate them on others after splitting up the dataset.
A system randomly selects which patients will be in the validation set for a given experiment and then group it by validation set. They’d have five groups because they use five-fold cross-validation and could very quickly see how the models did on each validation group of patients.
They use Neptune to filter the validation groups and notice which validation groups do well and those that do poorly. When they group the experiments, they see the groups as the validation sets and can analyze which patients the studies are not generalizing to.
Neptune is easy to set up and integrate with the existing stack without provisioning a separate infrastructure
Neptune provides a wide range of integrations and support for open-source tools used in the industry. The Theta Tech AI team found it helpful since it makes it simple to set up and run using tools like fastai and Optuna. Most of the callbacks for these tools were already set up in Neptune with no additional configurations or code to log and track experiment-relevant metrics.
The team also found Neptune easy to set up and get started with compared to tools like MLFlow, which would require additional configuration and provisioning on the team’s server.
Adding Neptune to the Theta Tech AI team’s tool stack proved valuable because:
Neptune accelerated the teams’ model development workflow and made it efficient by allowing them to rapidly return to past experiments and see how hyperparameters affect their results over several months.
Through Optuna’s sampling techniques, they could review previous experiments to see which ones were successful and what data versions and hyperparameter combinations produced them.
“Integrating Neptune with Optuna made it easy to get insights into the model performance and how well a particular hyperparameter from Optuna performed. We also looked at the graphs and determined if the loss function graph meant we were running with the correct number of epochs or if we needed more or fewer epochs.” – Dr. Robert Toth, Founder of Theta Tech AI
The team could share results and review experiments among themselves and external stakeholders like relevant clients involved in the research study. The Neptune-Optuna integration plug-in improved the collaboration process.
“Before using the Optuna integration from Neptune, we occasionally spun up the Optuna in-built dashboard, and since it could not handle the load, it kept crashing. Since we started using the Neptune plug-in, this has not been an issue, and the experiment review has become much shorter and seamless.” – Dr. Robert Toth, Founder of Theta Tech AI
Thanks to Dr. Robert Toth, Abhijit Ramesh, and Silas Bempong for working with us to create this case study!