📣 BIG NEWS: Neptune is joining OpenAI! → Read the message from our CEO 📣

Case Study

How Veo Eliminated Work Loss With Neptune

Working with Neptune has brought in more structured management and enhanced security compared to our earlier approach with MLflow.
Philip Pries Henningsen
Senior ML Researcher at Veo Technologies
Before
    Sometimes lost up to 1 week's work from tool crashes
    Struggled to visualize and compare evaluation metrics
After
    Eliminated downtime and data loss from server disruptions
    Can effortlessly analyze 50+ metrics

Veo builds Sports Cameras, a follow-camera with dual 4K lenses that can record a panoramic view of an entire sports field—in this case, soccer. The camera is embedded with computer vision and other AI models that power analytics, player tracking, ball tracking, and relevant statistics that coaches and teams can use to live-stream or analyze matches.

“Every product we develop originates from a machine learning component. It was our initial focus, shaping everything that followed. Without machine learning, Veo as it stands today wouldn’t exist.” says Philip Pries Henningsen, Senior ML Researcher at Veo Technologies.

Source: Veo

The challenge

Due to the ML-first nature of their product, Veo’s ML team trains many models. Because of the sheer size of the data they process—often 4 K video streams—the training jobs can run for a day or up to a few weeks.

Before switching to Neptune, Veo used MLflow to track experiments across all projects. The team deployed MLflow on a small EC2 instance and managed it themselves.

Due to this setup, MLflow was very unreliable, especially under heavy computational loads, leading to connectivity issues and crashes. This disrupted ongoing ML tasks and experiments. In the worst-case scenarios, it resulted in losing up to a week’s work, depending on the model and project.

Looking for a stable and reliable tracking solution

As Veo’s operations grew, the ML team spent more and more time maintaining MLflow, rerunning experiments, and tackling problems caused by MLflow crashes. So they decided to look for an alternative.

The criteria for a new solution were: 

  • A reliable, stable tool with minimal downtime;
  • Easy integration with their existing technology stack.

After evaluating their needs, Veo opted for Neptune, primarily due to its managed and stable environment.

Neptune distinguishes itself with its asynchronous mode of operation. The team’s data is saved locally in this setup and then periodically synchronized to the Neptune server. This approach ensures network disruptions or server issues don’t disrupt ongoing experiments. It’s a significant improvement over MLflow’s synchronous model.

Since adopting Neptune, Veo’s fear of data loss or prolonged downtime due to server issues has become a thing of the past. 

Not only has this resulted in a more reliable training process, but it has also translated into tangible time savings. Previously, unexpected crashes could set back their work by days, sometimes up to a week. Now, such setbacks are properly managed, ensuring no loss of work and improving the cost-effectiveness of their operations. This shift substantially increased the overall efficiency and productivity of Veo’s ML workflow.

avatar lazyload
quote
We needed a stable, reliable solution that keeps data safe even in unforeseen events. With Neptune, you feel that you can depend on it. You’re not scared that if the training job crashes overnight, you will lose days of your work. It’s crucial for our operations.
Philip Pries Henningsen Senior ML Researcher at Veo Technologies

Organizing and effectively comparing 50+ metrics

Due to the number of events in a given football sequence—like a goal, an offside, a pass, etc.—Veo’s projects often involve tracking many metrics from training epochs. 

Often, it’s 50+ metrics that they want to be able to record in the experiment tracking tool. MLflow’s UI would become slow or unresponsive at that scale, making it hard to inspect the metrics. 

On the other hand, they may only need to analyze a few of those metrics across runs to decide what goes to production. At the time, MLflow didn’t provide Veo with the functionality for filtering and comparing experiments. 

The team’s criteria for choosing an alternative solution for this challenge was a tool that could:

  • Scale to handle many metrics without impacting the user experience.
  • Provide quick comparisons between checkpoints within one or more training runs.

Neptune easily manages thousands of metrics from Veo’s image- and video-based experiments without compromising performance. The team can track and visualize their models’ progress in real-time, down to the epoch level. 

They easily search and filter through their tracked metadata to only compare the most relevant metrics. And they save those views in custom dashboards for every project. 

Veo’s ML team now focuses on analyzing metrics instead of refreshing interfaces and dealing with UI instability. 

avatar lazyload
quote
Visualizing multiple charts is more thought over in Neptune than in MLflow. We’re definitely saving time while inspecting performance because the UI is faster and smoother.
Philip Pries Henningsen Senior ML Researcher at Veo Technologies

Enhanced image and video data visualization

Veo’s primary data source is video footage from soccer matches. The team works with high-resolution video streams, likely in 4K or even higher resolutions—high-volume visual data. An evaluation of the setup would involve two real-time processing pipelines with 4K video streams.

Visualizing images and videos in MLflow wasn’t ideal for this volume. While they could log images as artifacts and view them in the interface, the visualization was relatively small, which made inspecting and analyzing images tedious. Also, MLflow doesn’t support logging videos, which is essential to the team’s experimentation process.

They needed a tool with good visualization functionality that supported visual data types. Neptune met those criteria.

The team can log images and videos and display them in the UI for visual training progress inspection without compromising the frames per second or video quality. The interface and visualizations work irrespective of the number of images.

avatar lazyload
quote
Image is more of a first-class citizen in Neptune. It’s easier to visualize them and see progress across different epochs.
Philip Pries Henningsen Senior ML Researcher at Veo Technologies
See in the app
Image gallery in the Neptune app

The results

  • Eliminated downtime and data loss from server disruptions.
  • Enhanced UI performance, handling 50+ metrics simultaneously.
  • Enabled real-time, high-resolution image and video visualizations.
  • Streamlined experiment comparison and dashboard customization.

Thanks to Philip Pries Henningsen and the team at Veo for collaborating with us to create this case study.

avatar
quote
As our company has grown from a startup to a sizeable organization of 200 people, robust security and effective user management have become increasingly evident and vital.
Philip Pries Henningsen Senior ML Researcher at Veo Technologies

Looking for a reliable tracking solution?