Case Study


We evaluated several commercial and open-source solutions. We looked at the features for tracking experiments, the ability to share, the quality of the documentation, and the willingness to add new features. Neptune was the best choice for our use cases.
James Tu
Research Scientist at Waabi

About Waabi

Waabi, founded by AI pioneer and visionary Raquel Urtasun, is building the next generation of self-driving truck technology. With a world-class team and an innovative, AI-first approach, Waabi is bringing the promise of self-driving closer to commercialization than ever before.

About team

The goal of the AI teams at Waabi is to develop a solution for self-driving trucks that can be used on a large scale. They do this by using deep learning, probabilistic inference, and complex optimization to create software that is end-to-end trainable, interpretable, and capable of very complex reasoning.

They organize their machine learning teams around different technical pillars. Each pillar is in charge of delivering technology for a different functional area. All teams have a mix of research and engineering projects.

Most of the time, their teams are built to be self-sufficient and able to deliver features and product capabilities from start to finish.


At a high level, AI teams at Waabi have a standard process to benchmark their research progress. They:

  • Establish a fair benchmark before they get too far into the work so that they know if they are making material progress. 
  • Set up baseline models, either from academic work or general knowledge.
  • Go through iterations of testing new ideas, validating them against the benchmark, and comparing them over time. 

They have a unified training workflow across all projects and datasets. This lets their teams train locally or on cloud infrastructure, scale training across a distributed cluster, and measure experimental results against end-to-end system performance metrics on a consistent benchmark suite. 

Sometimes they need to gather new data for that, whether from their vehicle fleet or other sources, depending on what problem they are solving and what is specific to that problem.  The Waabi team makes extensive use of the Waabi World simulator to accelerate their development

Waabi World and its core capabilities: World creation, camera and LiDAR sensor simulation, scenario generation and testing, and learning to drive in simulation
Waabi World and its core capabilities | Source


For autonomous systems to work, a self-driving system must achieve a complex understanding of the environment: 

  • The system needs to figure out where it is geographically
  • It needs to “see” and “make sense” of the world around it
  • It needs to be able to anticipate the behaviors from agents around, so it can decide what action to take.

As you might have guessed, there are a lot of tasks that would need different types of data for maps, LiDAR, camera, radar, inertial, and other sensor data.

Large-scale experimentation within a large team creates problems

Most ML-based teams at Waabi have a standard workflow that’s experiment-focused, which involves many AI scientists and engineers working together to establish a baseline for a new project. This required the teams to: 

  • Launch many experiments for different tasks.
  • Seek model improvements by iteratively fine-tuning them.
  • Compare results against established benchmarks.
  • Collaborate on the same or different experiments as they look towards building an optimal production model for other teams.
avatar lazyload
Our ML teams at Waabi continuously run large-scale experiments with ML models. A significant challenge we faced was keeping track of the data they collected from experiments and exporting it in an organized and shareable way.
Neil Isaac Senior Staff Software Developer at Waabi

When the team began planning their experiments and running large-scale benchmarks, which they wanted to share and compare results from, the problem became clear. Depending on the project they are working on, they could launch over ten training jobs and experiments per day. They found that if they didn’t keep track of the data they got from experiments, it would make their development workflow less visible and consistent.

avatar lazyload
We consider visibility and consistency to be fundamental to our workflow.
Neil Isaac Senior Staff Software Developer at Waabi

Building benchmarks is vital to the team because they need to know they are building models with tangibly higher quality based on real data. If they make any mistakes in the benchmarking process, it could thwart the entire feature they are building. They want to ensure they are testing the system against a consistent benchmark over time.


The teams recognized the need to keep track of experiment progress in a central location and compare results against the benchmarks. This would let people from different teams: 

  • Gain visibility into the experiments.
  • Reproduce them if necessary.
  • Achieve effective collaborations in projects with multiple users.
  • Use the corresponding artifacts for downstream systems.
avatar lazyload
We identified the lack of tooling as soon as we started planning and building consistent benchmark data sets. We considered our workflow and recognized that sharing benchmark results in a constant place and format and retaining data for later comparison after the end of a project were critical.
Neil Isaac Senior Staff Software Developer at Waabi

Searching for a solution for experiment tracking

Waabi evaluated open source solutions and well-known vendor products that could work well as stand-alone solutions to enable cross-collaboration across multiple teams. 

They required the following criteria for an experiment tracking solution:

  • Feature-rich experiment tracking (like collaborative and shareable workspaces, dashboards and visualization tools, API client to work programmatically, resource monitoring, etc.).
  • Good documentation quality.
  • Open to feature requests.
  • Reasonable pricing model.

Choosing for experiment tracking

Ultimately, Waabi chose because it met their requirements for an experiment tracking solution.

avatar lazyload
We evaluated several commercial and open-source solutions. We looked at the features for tracking experiments, the ability to share, the quality of the documentation, and the willingness to add new features. Neptune was the best choice for our use cases.
James Tu Research Scientist at Waabi

To make their decision, they reviewed Neptune’s feature support comparison table, the API documentation, and tested the tool by running one of their cloud training jobs.

avatar lazyload
Our team is always eager to adopt new tools that improve our workflows. We asked one member from each team to test the tool and figure out how to adopt it for their workflow, and they quickly became champions for the tool within their teams.
Neil Isaac Senior Staff Software Developer at Waabi

Also during the evaluation phase, Neil had several contacts from Neptune that worked proactively with him to help get feedback on the way they could leverage the tool for their use case. This gave the team some visibility on what the tool could do, and they could also see that some of their feature requests were already on the roadmap.

avatar lazyload
There were several highlights in our evaluation of Neptune. First, we appreciated that the team at Neptune was very open to our feature requests and took the initiative to connect with us and understand our use cases. Furthermore, Neptune’s experiment tracking features were excellent, fulfilled most of our needs immediately, and were open to working with us on forward-looking ideas.
James Tu Research Scientist at Waabi

The most difficult challenge for the team was ensuring that experiment results were consistent and discoverable. However, improving the experiment workflow got them to adopt Neptune.


  • 1 Helped sharing experiments across teams.
  • 2 Provided custom experiment tracking features.
  • 3 Helped with good quality documentation.
  • 4 Provided a feature for monitoring their computing resources.
  • 5 Allowed the team stop and abort runs remotely.
  • “If we’re just comparing it to having no experiment tracking, then Neptune is definitely super useful in terms of just having one place to organize all the results. Another important thing is that we can have a workspace that everyone has access to. So we can easily share experiments.” — James Tu, Research Scientist at Waabi

    One of the requirements of the team and a major challenge they faced was that there was a lack of visibility in the experiments that each person on the team ran, so it was difficult for teammates within and across different teams to collaborate.

    Neptune solved this problem with a collaborative workspace that made it easy and convenient for anyone with the proper permissions to share experiments with other people.

    See in the Neptune app

  • “When we launch a large number of experiments, the data is logged in Neptune, which we rarely set up.” We set up the experiments first, and Neptune will log whatever metrics we need—tables, figures, and stuff. “ — James Tu, Research Scientist at Waabi

    Neptune helped the team keep track of their experiments with custom features that let them log any data they wanted. Afterward, they’d go to to check the experiments, and in the workspace, they’d set up the custom dashboards with whatever they needed to visualize the experiment data in real-time.

    See in the Neptune app

  • “When we first started out considering different options, I think the quality of the documentation was definitely a deciding factor for us. I think we’re pretty happy with the Neptune documentation. Getting started is pretty straightforward.“ — James Tu, Research Scientist at Waabi

    The team found Neptune’s documentation to be of really good quality and improved over time, making it easy for anyone to find issues, how-to guides, and generally use the tool with little effort. documentation

  • “Our team leverages simulators quite heavily, and we use a lot of computers. One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.” — James Tu, Research Scientist at Waabi

    Some models they run are more data-heavy than others. The frontend of an autonomy system such as Perception has very large inputs. They would use a lot of different sensors and, in real-time, stream a lot of data into those models deployed on the vehicle. 

    For offline perception tasks like data augmentation, the system would use large models, which require a lot of training resources and more distributed jobs, more GPUs per worker, and more workers. Simpler automation jobs may be able to train on a single machine or a single GPU. 

    The team’s workloads are very scalable because they will often start development on a developer’s machine to make sure the code works, but then very quickly switch to running that on the cloud and scale it up as the data set grows.

    “In general, it’s always essential for us to optimize training runtime. It directly affects cost as well as, more importantly, productivity. The faster we train models, the sooner we get results. That’s incredibly important to us.” — Neil Isaac, Senior Staff Software Developer at Waabi

    The resource monitoring feature of benefited the team because they needed to keep an eye on large-scale training jobs across different experiments and teams to reduce cloud costs, improve team productivity, and make the best use of resources.

    See in the Neptune app

  • “One thing that’s really nice is that on Neptune, there’s a remote stop feature, and that’s really useful because we don’t need to kill a cloud training job, for example, we can just stop it.” — James Tu, Research Scientist at Waabi

    Neptune’s Remote Stop feature allowed the team to stop training jobs running on their cloud infrastructure  from the page without navigating cloud infrastructure-oriented dashboards, which is very convenient for them. 


avatar lazyload
The product has been very helpful for our experimentation workflows. Almost all the projects in our company are now using Neptune for experiment tracking, and it seems to satisfy all our current needs. It’s also great that all these experiments are available to view for everyone in the organization, making it very easy to reference experimental runs and share results.
James Tu Research Scientist at Waabi

Adopting was helpful for the team because it improved the visibility of their workflow and made sure results were reproducible. 

  • “Productivity has definitely improved. Neptune has made it easier to keep track of the experiments we are running and reduced the amount of overhead spent on organization. Also, Neptune’s remote stop feature is very useful for stopping experiments running on the cloud.” — James Tu, Research Scientist at Waabi

    Teammates now have a tool that helps them keep track of experiment metadata in a central repository, one that seamlessly integrates with their workflow. No matter where they are in the project lifecycle, helps them see how benchmarks and experiments are doing on consistent datasets.

    Neptune helped other teams discover insights from an experiment and made sure those results could be reproduced (or at least similar results could be reached) with the help of the logged metadata. 

    “I don’t think we’ve run into a lot of features that we would want but that are not there yet in Neptune.“  — James Tu, Research Scientist at Waabi

  • “Organic adoption by our teams has been a key indicator that the tool has added value to their workflows and that they have been able to use it successfully.” — Neil Isaac, Senior Staff Software Developer at Waabi

    Waabi has several engineers who are experts in MLOps. These engineers help build the clusters and tools that ML developers use daily. But pretty frequently, ML developers contribute to these tools as well, whether that’s adding new metrics or optimizing runtime.

    Different teams have used on their own to improve how they work and make the best use of their resources.

    “I would definitely recommend the product. The Neptune team made it easy to test and adopt the tool via a self-initiated trial but also took the time to make personal connections with our whole team and understand their needs.” — Neil Isaac, Senior Staff Software Developer at Waabi

Thanks to James Tu, Neil Isaac, and the team at Waabi for working with us to create this case study.

Want your team to be more productive and focus on experimentation?

    Contact with us

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

    * - required fields