MLOps Blog

Setting up a Scalable Research Workflow for Medical ML at AILS Labs [Case Study]

7 min
17th August, 2023

ailslab is a biomedical informatics research group on a mission to make humanity healthier. That mission is to build models which might someday save your heart from illness. It boils down to applying machine learning to predict cardiovascular disease development based on clinical, imaging, and genetics data.

Four full-time and over five part-time team members. Bioinformaticians, physicians, computer scientists, many on track to get PhDs. Serious business.

Although business is probably the wrong term to use because user-facing applications are not on the roadmap yet, research is the primary focus. Research so intense that it required a custom infrastructure (which took about a year to build) to extract features from different types of data:

  • Electronic health records (EHR),
  • Diagnosis and treatment information (time-to-event regression methods),
  • Images (convolutional neural networks),
  • Structured data and ECG.

With a fusion of these features, precise machine learning models can solve complex issues. In this case, it’s risk stratification for primary cardiovascular prevention. Essentially, it’s about predicting which patients are most likely to get cardiovascular disease

ailslab has a thorough research process. For every objective, there are seven stages:

  1. Define the task to be solved (e.g., build a risk model of cardiovascular disease).
  2. Define the task objective (e.g., define expected experiment results).
  3. Prepare the dataset.
  4. Work on the dataset in interactive mode with Jupyter notebooks; quick experimenting, figuring out the best features for both the task and the dataset, coding in R or Python. 
  5. Once the project scales up, use a workflow management system like Snakemake or Prefect to transform the work into a manageable pipeline and make it reproducible. Without that, it would be costly to reproduce the workflow or compare different models.
  6. Create machine learning models using Pytorch Lightning integrated with, where some initial evaluations are applied. Log experiment data.
  7. Finally, evaluate model performance and inspect the effect of using different sets of features and hyperparameters.

5 problems of scaling up Machine Learning research

ailslab started as a small group of developers and researchers. One person wrote code, and another reviewed it. Not a lot of experimenting. But collaboration became more challenging, and new problems started to appear along with the inflow of new team members:

  1. Data privacy,
  2. Workflow standardization,
  3. Feature and model selection,
  4. Experiment management,
  5. Information logging.

1. Data privacy

Data collection takes a lot of time. For medical machine learning research like this, it’s hard to get enough data, even unlabeled. It’s a critical problem that makes it difficult to build generalized models. The solution involved using private, NDA-protected data from hospitals and healthcare systems. 

You can’t upload such sensitive data to a remote server, so you can only train models locally. It’s a painful limitation when your team runs many experiments and stores information about each experiment. 

For ailslab, it became harder to manage experiments locally, compare different models, and even share results with others.

2. Workflow standardization

It was a much smaller team for a while. With just a couple of developers, it was easy to manage the project, despite working with a lot of custom code. 

“We spent the last year building up our infrastructure. We started to develop time-to-event regression methods and several feature extractors. For now, we build models like CNNs for imaging data and TCNs for ECG data. We have enough custom code. The goal now is to standardize something every month.” – Jakob Steinfeldt, Physician Researcher @ailslab

There was no reason to care much about structuring or standardizing the development flow, and there wasn’t much code inspection going on.

However, new developers kept joining the team, bringing along different styles of coding. The need for standardized code development solutions became evident. It became harder to debug and track code written by each team member because there were no clear procedures.

3. Feature and model selection

One of the core missions in this research is survival prediction using time-to-event regression methods. That involves observational data from large cohorts involving detailed patient records.

The record has different data modalities:

  • text, 
  • voice, 
  • Images,
  • etc. 

It creates a significant challenge: multimodal learning. It’s much more challenging to build a model with multiple types of data than a model that works with just one type. 

For example, there may be contextual data in the patient’s EHR (Electronic Health Record) in addition to medical images. The diagnosis requires both. 

AILS Labs Neptune metadata
Metadata logged in Neptune | Click to enlarge the image

A model like long short-term memory (LSTM) takes care of text, while a convolutional neural network (CNN) handles images. Each model has different properties and findings.

For a more accurate diagnosis, one can combine findings from different types of data. But, as ailslab researchers can attest to, this creates plenty of difficulties:

  • How do you extract representative features from different data modalities and fuse them?
  • What’s the best model type for all data modalities to do a task like time-to-event prediction? 
  • Which hyperparameters best fit the model and the data?

Answering these questions takes a lot of trial & error, like generating different sets of features to train multiple models. But there’s yet another problem – comparing different models trained by different sets of features. 

With as many experiments as the ailslab team does, it’s essential to have a foolproof way to track all work and avoid spinning in circles. Otherwise, it’s too difficult to keep track of all models trained on different versions of a dataset and its features. 

4. Experiment management

The team keeps growing, and researchers keep joining the project. Some of them might be less experienced. How do you make sure that two people don’t work on the same task by mistake?

Consider this scenario:

Twenty researchers, supervised by a technical lead, all working to find a solution to the same problem. Each person devises a solution in their own style and explains it to the technical lead. In turn, the tech lead figures out what’s correct and incorrect based on the results. 

The tech lead should be able to re-run code and reproduce good results. The hyperparameters, features, and metrics should be clear. For ailslab, tasks like manually creating checkpoints, or figuring out how to change one or more hyperparameters to do another experiment are simply a waste of time. 

Plus, students sometimes join the project for a limited time. It’s necessary to track and store data from experiments made by any project contributor, whether they’re still on the team or not. 

“You can keep track of your work in spreadsheets, but it’s super error-prone. And every experiment that I don’t use and don’t look at afterward is wasted compute: it’s bad for the environment, and it’s bad for me because I wasted my time.” – Thore Bürgel, PhD Student @ailslab

5. Information logging

For each experiment, some information needs to be logged, including:

  • The GPU used for training,
  • Model architecture,
  • Parameters like features, endpoints, dataset dates, paths,
  • Metrics and their performance.
AILS Labs Neptune monitoring
Monitoring in Neptune | Click to enlarge the image

This logged information characterizes each experiment. It comes in handy for analysis and answering questions like:

  • How does the model perform on different groups of records?
  • Which features or feature sets are most predictive for the endpoint?
  • How are features contributing to the prediction?

With a custom logger, it’s not easy to answer these questions or comprehensively compare different experiments. Building a custom logger comes with the burden of managing the logger long-term and adding new features when necessary. When bugs happen, that’s even more time to build an internal tool instead of doing research. It’s something that ailslab wanted to avoid.

Scaling up Machine Learning research is easier with

The five problems we described resulted from scaling the team and increasing the number of experiments. Neptune caught their attention in the PyTorch Lightning docs and joined their research toolset soon afterward.

“Neptune works flawlessly, and integrating it with PyTorch Lightning was very smooth.” – Jakob Steinfeldt, Physician Researcher @ailslab

Why Neptune? How did it help solve the five issues that came from scaling the team? Keep reading to find out.

Why ailslab chose Neptune

In short – because it saves time.

If you’re a researcher, you know that managing multiple experiments is challenging. 

Spreadsheets just stop cutting it at some point. If you have hundreds of experiments saved in spreadsheets, your local machine or cloud server is cluttered. You’ll probably never even use all of that experiment information to the fullest extent. 

You waste more time setting up spreadsheets, and then you rarely use those spreadsheets. Instead of doing that, you can have Neptune collect and store all information. 

You see a comprehensive history of all your experiments, compare them, and choose which ones aren’t worth keeping. There’s less overhead preparing the environment, and you save time.

Learn more

Switching from Spreadsheets to and How It Pushed My Model Building Process to the Next Level

Data privacy

As medical data is particularly sensitive, it was necessary to separate the data workflows from the analysis workflows

Neptune is compliant with data protection laws by removing the need to upload sensitive data. It only receives logged information that researchers decide to share, and the training part can happen on a local machine or anywhere else. It means maximum control. Data is always secure, and there is no risk of data exposure.

“By allowing us just to train the model locally and logging ML metadata without uploading the actual data, Neptune solved a huge problem for us.” – Jakob Steinfeldt, Physician Researcher @ailslab

Workflow standardization

On a large scale, it’s better to use standardized tools for development. No one can argue that. Neptune standardized research for ailslab because:

  • It enabled using a standard library to build models, which is much easier than writing custom code that’s hard to explain or debug. 
  • With the full PyTorch Lightning integration, Neptune offers a standardized view of logged information. You can log whatever you like, and Neptune handles all information in a simple view. All team members are using the same infrastructure. 
  • Neptune unifies how everyone presents results, so there’s less miscommunication

“We have many students, new to many aspects of our research and the infrastructure we’re using. This setup creates room for a lot of potential problems. Neptune helps us communicate and avoid those problems. Instead of discussing, we can just look at everything in the hyperparameter space or feature space. It’s very convenient.” – Jakob Steinfeldt, Physician Researcher @ailslab

Feature and model selection

With Neptune, it’s easier to select the best features for models because it has multiple ways to present and compare different experiments:

  • There’s a beautiful and straightforward user interface with elastic-like search capabilities.
  • You can compare model performance in interactive charts and see which model is best for a given set of features. 
  • You can see which features best describe a group of records when you select parameters for comparison, in addition to the evaluation metric.

Model performance (i.e., prediction accuracy) is rarely the only objective. The ailslab team also cares about the resources used. Neptune measures hardware metrics to see how much power was consumed (RAM or GPU) to get the result.

AILS Labs Neptune charts
Metrics’ charts in Neptune | Click to enlarge the image

Plus, Neptune easily scales to millions of experiments. It means that even in the case of multimodal learning with many moving parts and a vast amount of experimentation, tracking all of it is still convenient.

Experiment management

Given that different users contribute to the same project, Neptune plays a critical role in managing contributions from different researchers. It makes it much easier to supervise researchers and compare their experiments, and it all happens in one dashboard.

“So I would say the main argument for using Neptune is that you can be sure that nothing gets lost, everything is transparent, and I can always go back in history and compare.” – Thore Bürgel, PhD Student @ailslab

Organizing experiments is no longer an issue, as Neptune does it quite elegantly. Plus, Neptune versions data for better control of experiments. 

Read more

Check how you can organize your experimentation process with Neptune.

Information logging

Neptune can group experiments for comparison. It’s easy to get a link to share the results with another researcher or stakeholder. Even if a researcher or a student leaves the project and is no longer available, all information about their experiment is saved.

Being a logger and metadata store, Neptune has an automated process to log each experiment. It has an API that lets researchers log their results with zero hassle. All experiments are visible across all members of the team, which makes the whole project transparent. 

Neptune – more time for ML, easier collaboration, full transparency

Two things that stood out the most about Neptune are:

  • Ease of use,
  • Everything organized in one place.

Compared to using a custom logger, Neptune just takes care of everything, and the team has more time to do research tasks. Neptune’s simple UI makes it easier than ever to build and compare experiments.

ailslab researchers now use one platform with their results presented in the same way. It’s easier to supervise their work even if they work short-term or don’t have experience presenting their results to others. Comparing data parameters used by different researchers is not a problem anymore. It leaves less room for mistakes.

Comparing and managing experiments also takes less time. Researchers can go back and forth between the history of experiments, make changes, and see how the changes affect the results. More experiments get done, and work is more productive. Team members can just log in to Neptune and see all necessary data without cluttering their drives or servers with spreadsheets. 

AILS Labs Neptune dashboard
Runs view in Neptune | Click to enlarge the image

Building complex models and exploring how they work became a bit easier. Neptune stores data about the environment setup, the underlying code, and the model architecture. 

Finally, Neptune helps organize things. In ailslab, they add experiment URLs from Neptune to cards in their Kanban board in Notion. This easy access to experiment information helps keep everything organized. The whole team has a better idea about things like the effect of hyperparameters on the model. 

Machine learning is hard. Building ML models to detect heart disease before it happens adds another very thick layer of difficulty. We’re glad that Neptune takes away the tedious parts of ailslab projects, and we wish them all the best in their research.

Was the article useful?

Thank you for your feedback!