Case Study

Theta Tech AI

Neptune and Optuna go hand in hand. You should start using Neptune as early as possible to save the trouble of having to go through multiple log statements to make sense of how your model did.
Dr. Robert Toth
Founder of Theta Tech AI
Before
    Kept losing a lot of time on checking irrelevant data in AWS CloudWatch
After
    Have dedicated experiment tracking solution that provides relevant insights and visualizations

Theta Tech AI builds customized artificial intelligence algorithms and front-end user interfaces for large-scale healthcare AI systems. Its main objective is to build hospitals in the cloud powered by AI.

Their products are image and signal-processing tools that detect anomalies indicating health risks.

About the team

The team comprises seven engineers building AI systems for healthcare businesses. The group focuses on developing generalizable medical AI systems representative of the real world. These systems are deployed in hospitals to help healthcare providers increase clinical effectiveness and efficiency.

The team works with 1D ECG signals, 2D X-rays, or 3D magnetic resonance imaging (MRI) medical and biological datasets. They offer various analytical services, from data preprocessing, pattern recognition, and classification to model testing and validation.

Workflow

The team uses the Amazon Web Services (AWS) GPU servers to run training workloads. They store the datasets on AWS S3 — they’ve integrated their stack into the AWS ecosystem. They develop machine learning models using fastai and PyTorch APIs. They use Optuna to specify and optimize the model hyperparameters and Neptune.

The production workload runs on Microsoft Azure. They download the weights of their saved models (neural networks, most of the time) from the Neptune dashboard for production and then push them to Azure.

Problem

We prefaced this case study with what Theta AI’s engineering team focuses on: “developing generalizable medical AI systems representative of the real world.” There are several roadblocks to developing generalizable machine learning (ML) models—training low-quality models is right up there (well, alongside using poor-quality data).

The team would run several studies that required them to keep track of thousands of experiments for large-scale parallel training workflows. Since the training workload happened on GPU servers in AWS, the natural choice for monitoring the jobs was AWS CloudWatch Logs. The team began to see some significant constraints when they tried to analyze the outputs from CloudWatch Logs.

Realizing the insufficiencies of Cloud Logging Services to manage experiments at scale

avatar lazyload
quote
The problem was that experiments were running on training servers, and the only way we could track how they were doing was through the CloudWatch Logs. Otherwise, we would have to write them in a log file and retrieve them.
Abhijit Ramesh ML Engineer at Theta Tech AI

The team found that AWS CloudWatch Logs was inadequate for managing experiment logs. Teams leveraging the public cloud for ML training jobs often encounter this challenge. Cloud-native logging tools are usually not purpose-built to help them productively manage the experimentation process. 

The Theta Tech AI team was unable to perform a few tasks that were essential to their workflow:

  • Get experiment-relevant metrics from AWS CloudWatch Logs
  • Productively debug problems with training jobs and experiments
  • Optuna integration for hyperparameter optimization
  • Communicate the results of ML models to clients
  • “The limitation of AWS CloudWatch Logs is that we are logging many things. You don’t only get experiment-relevant metrics, but you also get server logs, and other things CloudWatch thinks are useful. It’s like a manual process where I go through every component.” – Silas Bempong, Data Scientist at Theta Tech AI

    The team would use CloudWatch Logs to monitor and troubleshoot their entire stack running on AWS. The problem became apparent when they realized it would take additional time and effort to sort and filter experiment-related metrics manually.

  • “There was a situation where one of the experiments failed due to dependency conflict, and we had to dig through all the logs to get that, which helped us debug. However, in terms of efficiency, we could have done more productive things rather than go through the logs.” – Silas Bempong, Data Scientist at Theta Tech AI

    The team struggled to effectively troubleshoot model training problems since it was challenging to sift and filter experiment-related metrics from CloudWatch Logs. Additionally, they could not keep track of the model training information in real-time and quickly identify underperforming experiments.

  • “Let’s say we are running 15 or 30 experiments simultaneously and Optuna samples for hyperparameters with its sampling strategies. The hyperparameters we sample for each experiment are logged on the servers, and they start running those experiments.

    The problem is that we can only track how the experiments are running through the CloudWatch Logs. Otherwise, we would have to write them in a log file and get them.” – Abhijit Ramesh, ML Engineer at Theta Tech AI

    Hyperparameter optimization is crucial to the team’s experimentation workflow and makes it efficient. Without utilizing other tools or creating scripts to analyze the logs, they found it challenging to sort through the Optuna-based experiments and comprehend the results because of CloudWatch Logs.

  • “In CloudWatch, we don’t have visuals and graphics, which is a big thing for me.” – Dr. Robert Toth, Founder of Theta Tech AI

    Theta Tech AI’s business strategy depends on sharing and communicating with clients the findings of research projects based on machine learning experiments. The team could not convey or visualize the experiment results using CloudWatch logs.

    They would need to employ third-party visualization tools, such as Grafana or ElasticSearch, to view the experiment logs, increasing their stack’s complexity.

Solution

Since most of the team’s experiments are Optuna-based, they needed a solution focused on experiment tracking that could interact with Optuna to track hyperparameters and offer collaborative features.

avatar lazyload
quote
After conducting initial proof-of-concept research, our team recognized the need to track thousands of experiments for large-scale parallel training.
Dr. Robert Toth Founder of Theta Tech AI

Neptune ended up being the ideal choice for them to achieve their objective. Beyond experiment tracking, Neptune offers experiment grouping and filtering functionality that more expensive competitors lacked, in addition to native integration with Optuna.

avatar lazyload
quote
After determining Neptune’s ability to group and filter experiments more effectively, we found that it allowed for better integration with Optuna and Fast.ai. It could easily conduct hyperparameter sweeps, which is something we were looking for as we wanted to leverage the power of Optuna rather than using Weights and Biases’ own Hyperparameter Sweep.
Dr. Robert Toth Founder of Theta Tech AI

Criteria Theta Tech AI considered for an ideal solution. The team outlined four (4) criteria that an ideal experiment tracking solution for them should have:

  • 1 Integration with open-source tools proven to work well and maintained by a community of developers
  • 2 Provide real-time support
  • 3 Easy to interpret visualizations
  • 4 Easy to develop

Settling for an ideal solution

avatar lazyload
quote
We tested multiple platforms by running experiments on them. It was clear that Neptune was the right choice for us.
Dr. Robert Toth Founder of Theta Tech AI

Neptune met the criteria the team outlined and provided the following solutions: 

  • It helps track thousands of training jobs running on AWS at scale
  • It offers seamless Neptune-Optuna integration
  • It features an interactive real-time dashboard for Optuna
  • It provides a grouping and filtering feature valuable for organizing experiments
  • Setting up and integrating with the existing stack is easy without provisioning a separate infrastructure

Neptune helps Theta Tech AI track thousands of training jobs running on AWS at scale

avatar lazyload
quote
Neptune is an excellent choice for users with large-scale training activities, as the Neptune workflow is already set up to handle the most common model training scenarios. It’s a great tool to organize deep learning experiments, whereas most people have home-grown dashboards or, worse, no dashboards.
Dr. Robert Toth Founder of Theta Tech AI

After using Neptune, the team could finally track and view only the metrics and files necessary for their research projects. They discovered that Neptune could scale to track all the jobs operating on different compute clusters when they ran thousands of training runs at scale on AWS.

Neptune provides relevant dashboards and an interactive user interface to monitor their training jobs and hardware utilization and share the dashboards as reports amongst colleagues and relevant clients. 

Neptune removed the CloudWatch Logs barrier of leveraging external visualization tools like Grafana and provides secure and collaborative options. Additionally, it enforced experiment lineage, making it simple for the team to review earlier experiments, troubleshoot them, and reproduce their results.

avatar lazyload
quote
We are very integrated with AWS and want everything to happen inside of AWS, and when you are training on a large scale, you want multiple training jobs to happen at once, and that is where Neptune comes in.
Abhijit Ramesh ML Engineer at Theta Tech AI

Seamless Neptune-Optuna integration to make hyperparameter optimization simple

avatar lazyload
quote
Neptune and Optuna go hand in hand. You should start using Neptune as early as possible to save the trouble of having to go through multiple log statements to make sense of how your model did.
Dr. Robert Toth Founder of Theta Tech AI

The team wanted a solution that could effortlessly integrate with Optuna. They sampled hyperparameters using Optuna and needed a simple way to display the results of different hyperparameter groups.

They would train several models in parallel on AWS using the hyperparameters, and they would need to be able to:

  • Discover the model’s performance throughout training, testing, and validation
  • Learn how well the hyperparameter worked

The Neptune-Optuna integrated dashboard gives them insights into how well the hyperparameters performed and offers all the information about the performance of each model.

Leveraging Neptune’s real-time dashboard over Optuna’s web dashboard

avatar lazyload
quote
Neptune provided a solution for Optuna’s single-user dashboard view in the form of an Optuna integration, which made it possible to deliver dashboards in a shared and collaborative manner that did not take minutes to render.
Dr. Robert Toth Founder of Theta Tech AI

One of the team’s main issues was that Optuna was designed to have a single person view the default dashboard while running small experiments. This wasn’t sufficient for their requirements because they needed to share insights with their internal development team and clients. 

Neptune-Optuna dashboard courtesy of Theta Tech AI
Neptune-Optuna dashboard courtesy of Theta Tech AI

They could see the Optuna optimization process for experiments in real-time, thanks to the Neptune-Optuna integration. They could share dashboards with coworkers and clients participating in the project, facilitating communication and cooperation because everyone could see the dashboard at once.

Neptune provides Theta Tech AI experiment grouping and filtering features

Neptune improved how the team utilized compute resources for their training and data processing jobs. For example, running large data processing jobs on distributed clusters is one of the most compute-intensive tasks for the team. Neptune provided them with better insights into how their image data augmentation programs utilized resources to maximize their GPU usage. 

avatar lazyload
quote
Grouping by validation set is super important to us, and many other people would benefit from using the grouping features with validation.
Dr. Robert Toth Founder of Theta Tech AI

By combining and filtering the results by validation datasets, Neptune helps the team organize the experiments productively. The grouping feature is essential because it allows them to train models on some patients and test and validate them on others after splitting up the dataset.

A system randomly selects which patients will be in the validation set for a given experiment and then group it by validation set. They’d have five groups because they use five-fold cross-validation and could very quickly see how the models did on each validation group of patients.

They use Neptune to filter the validation groups and notice which validation groups do well and those that do poorly. When they group the experiments, they see the groups as the validation sets and can analyze which patients the studies are not generalizing to.

Neptune AI experiment tracking dashboard courtesy of Theta Tech AI
Neptune AI experiment tracking dashboard courtesy of Theta Tech AI

Neptune is easy to set up and integrate with the existing stack without provisioning a separate infrastructure

avatar lazyload
quote
With the fastai integration, we only required one line of Python code in our stack, and everything got pushed to Neptune. It was easy for us to set up the API key and push everything automatically to Neptune. Our clients also use the Neptune API to search through all of our experiments for the top-performing ones and then push them to production.
Dr. Robert Toth Founder of Theta Tech AI

Neptune provides a wide range of integrations and support for open-source tools used in the industry. The Theta Tech AI team found it helpful since it makes it simple to set up and run using tools like fastai and Optuna. Most of the callbacks for these tools were already set up in Neptune with no additional configurations or code to log and track experiment-relevant metrics.

The team also found Neptune easy to set up and get started with compared to tools like MLFlow, which would require additional configuration and provisioning on the team’s server.

avatar lazyload
quote
I have also used MLflow, and it was tough to set it up. So even for my projects, I use Neptune.
Abhijit Ramesh ML Engineer at Theta Tech AI

Results

avatar lazyload
quote
We had a time when results started to dip, and we used Neptune to replicate our former best results, and this would not have been possible before using Neptune. It’s also an excellent audit log to recall what we changed with each new AI study.
Dr. Robert Toth Founder of Theta Tech AI

Adding Neptune to the Theta Tech AI team’s tool stack proved valuable because:

  • Neptune accelerated the teams’ model development workflow and made it efficient by allowing them to rapidly return to past experiments and see how hyperparameters affect their results over several months.

    Through Optuna’s sampling techniques, they could review previous experiments to see which ones were successful and what data versions and hyperparameter combinations produced them.

    “Integrating Neptune with Optuna made it easy to get insights into the model performance and how well a particular hyperparameter from Optuna performed. We also looked at the graphs and determined if the loss function graph meant we were running with the correct number of epochs or if we needed more or fewer epochs.” – Dr. Robert Toth, Founder of Theta Tech AI

  • The team could share results and review experiments among themselves and external stakeholders like relevant clients involved in the research study. The Neptune-Optuna integration plug-in improved the collaboration process.

    “Before using the Optuna integration from Neptune, we occasionally spun up the Optuna in-built dashboard, and since it could not handle the load, it kept crashing. Since we started using the Neptune plug-in, this has not been an issue, and the experiment review has become much shorter and seamless.” – Dr. Robert Toth, Founder of Theta Tech AI


Thanks to Dr. Robert Toth, Abhijit Ramesh, and Silas Bempong for working with us to create this case study!

avatar
quote
We had a time when results started to dip, and we used Neptune to replicate our former best results, and this would not have been possible before using Neptune. It’s also an excellent audit log to recall what we changed with each new AI study.
Dr. Robert Toth Founder of Theta Tech AI

Looking for an experiment tracker that will easily integrate with the existing stack?