How BGU Research Group Tracks Thousands of Models With Neptune
Omri Azencot is an Assistant Professor in the Computer Science department at Ben-Gurion University of the Negev. He leads a research group of around 20 members, including PhD, master’s, and undergraduate students. His team focuses on developing machine learning models for sequential data, with an emphasis on representation learning and generative modeling.
The challenge
Omri’s group is currently managing about 15 projects, some of them more active, some less. Within each project, they train different models for various problems. Often, it’s thousands of experiments per project.
Each project is a collaborative work. Whenever a model achieves good results, the entire group reviews the run to understand what worked and what didn’t, comparing it to other experiments.
Managing these complex, multi-project workflows across a large, distributed team presented several challenges:
- Onboarding new team members efficiently
- Tracking and comparing thousands of experiments
- Collaborating effectively among team members, both locally and remotely
- Optimizing limited computational resources
Implementing experiment tracking solution in an academic research setting
To address these challenges, Omri’s group implemented Neptune.
How is their workflow reflected in Neptune? The team has one workspace divided into multiple projects. While each team member has access to all projects, they could manage access restrictions (e.g., for data privacy reasons), though they currently find it unnecessary.
Using Neptune, the team logs every model along with key metadata, such as parameters, loss functions, and system resource consumption. They can compare and analyze this data, and the training runs easily.
Neptune’s custom views of the run table have been instrumental for their work. Filtering runs based on different datasets or models, the group can track and compare results more effectively.
Team members engaged in a project look at the table to identify interesting runs and then dive into those specific runs for detailed analysis. They appreciate that Neptune supports logging and displaying different metadata types as they need to look at many things to debug training effectively—images, metrics, plots, sometimes even text or audio.
Working as a distributed team, they frequently rely on Neptune during remote meetings to discuss project progress and even include links to specific Neptune runs in presentations.
The results
With Neptune, Omri’s research group:
- Enabled collaboration within a distributed team.
- Implemented a more organized and efficient process of debugging and analyzing training results.
- Reduced the number of models that need to be trained and, in consequence, optimized resource usage.