Waabi
About Waabi
Waabi, founded by AI pioneer and visionary Raquel Urtasun, is building the next generation of self-driving truck technology. With a world-class team and an innovative, AI-first approach, Waabi is bringing the promise of self-driving closer to commercialization than ever before.
About team
The goal of the AI teams at Waabi is to develop a solution for self-driving trucks that can be used on a large scale. They do this by using deep learning, probabilistic inference, and complex optimization to create software that is end-to-end trainable, interpretable, and capable of very complex reasoning.
They organize their machine learning teams around different technical pillars. Each pillar is in charge of delivering technology for a different functional area. All teams have a mix of research and engineering projects.
Most of the time, their teams are built to be self-sufficient and able to deliver features and product capabilities from start to finish.
Workflow
At a high level, AI teams at Waabi have a standard process to benchmark their research progress. They:
- Establish a fair benchmark before they get too far into the work so that they know if they are making material progress.
- Set up baseline models, either from academic work or general knowledge.
- Go through iterations of testing new ideas, validating them against the benchmark, and comparing them over time.
They have a unified training workflow across all projects and datasets. This lets their teams train locally or on cloud infrastructure, scale training across a distributed cluster, and measure experimental results against end-to-end system performance metrics on a consistent benchmark suite.
Sometimes they need to gather new data for that, whether from their vehicle fleet or other sources, depending on what problem they are solving and what is specific to that problem. The Waabi team makes extensive use of the Waabi World simulator to accelerate their development

Problem
For autonomous systems to work, a self-driving system must achieve a complex understanding of the environment:
- The system needs to figure out where it is geographically.
- It needs to “see” and “make sense” of the world around it.
- It needs to be able to anticipate the behaviors from agents around, so it can decide what action to take.
As you might have guessed, there are a lot of tasks that would need different types of data for maps, LiDAR, camera, radar, inertial, and other sensor data.
Large-scale experimentation within a large team creates problems
Most ML-based teams at Waabi have a standard workflow that’s experiment-focused, which involves many AI scientists and engineers working together to establish a baseline for a new project. This required the teams to:
- Launch many experiments for different tasks.
- Seek model improvements by iteratively fine-tuning them.
- Compare results against established benchmarks.
- Collaborate on the same or different experiments as they look towards building an optimal production model for other teams.

When the team began planning their experiments and running large-scale benchmarks, which they wanted to share and compare results from, the problem became clear. Depending on the project they are working on, they could launch over ten training jobs and experiments per day. They found that if they didn’t keep track of the data they got from experiments, it would make their development workflow less visible and consistent.

Building benchmarks is vital to the team because they need to know they are building models with tangibly higher quality based on real data. If they make any mistakes in the benchmarking process, it could thwart the entire feature they are building. They want to ensure they are testing the system against a consistent benchmark over time.
Solution
The teams recognized the need to keep track of experiment progress in a central location and compare results against the benchmarks. This would let people from different teams:
- Gain visibility into the experiments.
- Reproduce them if necessary.
- Achieve effective collaborations in projects with multiple users.
- Use the corresponding artifacts for downstream systems.

Searching for a solution for experiment tracking
Waabi evaluated open source solutions and well-known vendor products that could work well as stand-alone solutions to enable cross-collaboration across multiple teams.
They required the following criteria for an experiment tracking solution:
- Feature-rich experiment tracking (like collaborative and shareable workspaces, dashboards and visualization tools, API client to work programmatically, resource monitoring, etc.).
- Good documentation quality.
- Open to feature requests.
- Reasonable pricing model.
Choosing neptune.ai for experiment tracking
Ultimately, Waabi chose neptune.ai because it met their requirements for an experiment tracking solution.

To make their decision, they reviewed Neptune’s feature support comparison table, the API documentation, and tested the tool by running one of their cloud training jobs.

Also during the evaluation phase, Neil had several contacts from Neptune that worked proactively with him to help get feedback on the way they could leverage the tool for their use case. This gave the team some visibility on what the tool could do, and they could also see that some of their feature requests were already on the roadmap.

The most difficult challenge for the team was ensuring that experiment results were consistent and discoverable. However, improving the experiment workflow got them to adopt Neptune.
Neptune:
- 1 Helped sharing experiments across teams.
- 2 Provided custom experiment tracking features.
- 3 Helped with good quality documentation.
- 4 Provided a feature for monitoring their computing resources.
- 5 Allowed the team stop and abort runs remotely.
-
“If we’re just comparing it to having no experiment tracking, then Neptune is definitely super useful in terms of just having one place to organize all the results. Another important thing is that we can have a workspace that everyone has access to. So we can easily share experiments.” — James Tu, Research Scientist at Waabi
One of the requirements of the team and a major challenge they faced was that there was a lack of visibility in the experiments that each person on the team ran, so it was difficult for teammates within and across different teams to collaborate.
Neptune solved this problem with a collaborative workspace that made it easy and convenient for anyone with the proper permissions to share experiments with other people.
-
“When we launch a large number of experiments, the data is logged in Neptune, which we rarely set up.” We set up the experiments first, and Neptune will log whatever metrics we need—tables, figures, and stuff. “ — James Tu, Research Scientist at Waabi
Neptune helped the team keep track of their experiments with custom features that let them log any data they wanted. Afterward, they’d go to neptune.ai to check the experiments, and in the workspace, they’d set up the custom dashboards with whatever they needed to visualize the experiment data in real-time.
-
“When we first started out considering different options, I think the quality of the documentation was definitely a deciding factor for us. I think we’re pretty happy with the Neptune documentation. Getting started is pretty straightforward.“ — James Tu, Research Scientist at Waabi
The team found Neptune’s documentation to be of really good quality and improved over time, making it easy for anyone to find issues, how-to guides, and generally use the tool with little effort.
-
“Our team leverages simulators quite heavily, and we use a lot of computers. One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.” — James Tu, Research Scientist at Waabi
Some models they run are more data-heavy than others. The frontend of an autonomy system such as Perception has very large inputs. They would use a lot of different sensors and, in real-time, stream a lot of data into those models deployed on the vehicle.
For offline perception tasks like data augmentation, the system would use large models, which require a lot of training resources and more distributed jobs, more GPUs per worker, and more workers. Simpler automation jobs may be able to train on a single machine or a single GPU.
The team’s workloads are very scalable because they will often start development on a developer’s machine to make sure the code works, but then very quickly switch to running that on the cloud and scale it up as the data set grows.
“In general, it’s always essential for us to optimize training runtime. It directly affects cost as well as, more importantly, productivity. The faster we train models, the sooner we get results. That’s incredibly important to us.” — Neil Isaac, Senior Staff Software Developer at Waabi
The resource monitoring feature of neptune.ai benefited the team because they needed to keep an eye on large-scale training jobs across different experiments and teams to reduce cloud costs, improve team productivity, and make the best use of resources.
-
“One thing that’s really nice is that on Neptune, there’s a remote stop feature, and that’s really useful because we don’t need to kill a cloud training job, for example, we can just stop it.” — James Tu, Research Scientist at Waabi
Neptune’s Remote Stop feature allowed the team to stop training jobs running on their cloud infrastructure from the neptune.ai page without navigating cloud infrastructure-oriented dashboards, which is very convenient for them.
Results

Adopting neptune.ai was helpful for the team because it improved the visibility of their workflow and made sure results were reproducible.
-
“Productivity has definitely improved. Neptune has made it easier to keep track of the experiments we are running and reduced the amount of overhead spent on organization. Also, Neptune’s remote stop feature is very useful for stopping experiments running on the cloud.” — James Tu, Research Scientist at Waabi
Teammates now have a tool that helps them keep track of experiment metadata in a central repository, one that seamlessly integrates with their workflow. No matter where they are in the project lifecycle, neptune.ai helps them see how benchmarks and experiments are doing on consistent datasets.
Neptune helped other teams discover insights from an experiment and made sure those results could be reproduced (or at least similar results could be reached) with the help of the logged metadata.
“I don’t think we’ve run into a lot of features that we would want but that are not there yet in Neptune.“ — James Tu, Research Scientist at Waabi
-
“Organic adoption by our teams has been a key indicator that the tool has added value to their workflows and that they have been able to use it successfully.” — Neil Isaac, Senior Staff Software Developer at Waabi
Waabi has several engineers who are experts in MLOps. These engineers help build the clusters and tools that ML developers use daily. But pretty frequently, ML developers contribute to these tools as well, whether that’s adding new metrics or optimizing runtime.
Different teams have used neptune.ai on their own to improve how they work and make the best use of their resources.
“I would definitely recommend the product. The Neptune team made it easy to test and adopt the tool via a self-initiated trial but also took the time to make personal connections with our whole team and understand their needs.” — Neil Isaac, Senior Staff Software Developer at Waabi
Thanks to James Tu, Neil Isaac, and the team at Waabi for working with us to create this case study.