We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”
“The more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself.”
Zoined offers Retail and Hospitality Analytics as a cloud-based service for different roles from top management to manager level. The service collects sales data from stores and venues including inventories, time and attendance, and visitor tracking systems as well as webstores. The data is analyzed and presented in a very accessible, visual format for business owners so they can get real-time, actionable insights for their business and select the preferred time frames they want to report on. The product also allows businesses to filter and group their data easily and create custom views, grasp trends quickly with charts and graphs.
With Zoined® businesses have access to a fully-managed, off-the-shelf solution with ready-made dashboards and analytics for retail and wholesale, especially fashion, food retail, coffee shops’, and restaurants’ needs.
Running lots of experiments, especially in a start-up with few scientists and engineers solving problems, can get daunting. Tracking the experiments, versioning the datasets as they inevitably get larger, and generally taking procedures to get reproducible results can be very tricky to navigate. This was the problem Kha faced when he first joined Zoined.
“When I joined this company, we were doing quite many different experiments and it’s really hard to keep track of them all so I needed something to just view the result or sometimes or also it’s intermediate results of some experiments like what [does] the data frame look like? What [does] the CSV look like? Is it reasonable? Is there something that went wrong between the process that resulted in an undesirable result? So we were doing it manually first but just writing… some log value to some log server like a Splunk.”
In addition, he was the only one responsible for the forecasting pipeline in Zoined which made experiment tracking more tedious to conduct manually.
Kha was also working with large data frames with forecasts (predictions) that needed to be logged alongside their experiments, as well as find a way to visualize results for complete experiments and intermediate experiments so he can be more efficient during the experimentation process.
Problems with Splunk for experiment tracking
The first solution the team tried was to manually log experiment values to Splunk. For one, beginning with such a tool can be intimidating to get started with.
Another problem is that visualizing logged (experiment) values is quite difficult to do and might require some expert help to set up.
Finally, Splunk can get expensive pretty fast — especially for a company that runs a lot of experiments and will need to send a large volume of data to the log server.
Problems with maintaining MLFlow
Reliability and speed of MLFlow
The next solution Kha tried was MLflow. One of the issues he had with using MLflow was the hosting options available. As he mentioned, the hosted MLflow solution is Databricks. He started using a self-hosted MLflow solution but it quickly became difficult to manage for an individual.
“… And then I came across ML flow. We started using it but I think that the only way to have a hosted MLflow solution is by using Databricks. Otherwise, we were hosting MLflow ourselves and I’m the only person responsible for the entire forecasting pipeline here. So it’s really a hassle for me to experiment and maintain MLflow at the same time that we have to prepare the database for it, S3 for it, and then some server to run it on.”
As he found out, using MLflow can be compute-intensive, consuming a lot of RAM and running quite slowly.
“Sometimes, MLflow is not reliable because I think it’s not really optimized as it consumes quite a lot of RAM and runs really slow too.”
Hosting MLflow on a local server also comes with the problem of auto scaling for Kha. In most cases, MLflow couldn’t handle a large stream of logs either crashing or the UI stops responding, slowing down his experimentation workflow. As he mentions:
“The real headache came when we ran like 100 experiments, 100 forecasts at the same time and all of that started streaming data into MLflow. That’s when we see some error flow is not responding, not available.”
To get MLflow to work for the stream of logs, he had to scale up the number of instances which became a complex operations work to handle.
“… So I had to increase the number of instances but it doesn’t really make sense to have them scale up all the time. What I could do was to set up some kind of elastic scale for instance. This required more infrastructure maintenance because I felt like the Terraform infrastructure was growing quite big and I thought this is not a good direction to go because it may grow even bigger. That was when I felt like if I could share this with some other people, it would be more efficient.”
MLflow would have in fact been a great tool for Kha to manage 100s of experiments if only it didn’t have some of the issues we listed above.
Problems with collaboration on MLflow
Collaboration with a self-hosted MLflow solution was a problem for Kha because sharing experiments was difficult to do as he needed to create URL aliases for logs, especially if he wanted to share them with other collaborators. As he mentioned:
“There’s also the issue of creating an URL alias for it (MLflow). So I feel like why do I have to do all this manually?”
Kha needed a solution like MLflow but without the hassle that the self-hosted MLflow solution brought about. He needed a solution that:
- Was completely managed,
- Didn’t take too long to set up and get started with,
- Could elastically scale to large volumes of experiments logs and forecast dataset,
- Was also completely automated and fast,
- Could be customized and integrated with existing technologies.
Kha decided to do some digging and came across Neptune, which met all the requirements he needed.
“I felt like “why do I have to do all these manually?” And then I came across Neptune and it seems like, okay, this is the host, the solution and it seems to be equivalent to MLflow.”
Kha decided to choose Neptune as Zoined’s solution for logging experiment metadata because:
- 1It is fully managed, fast, and scalable
- 2It is a better price to value ratio and accessible
- 3It has better charts and visualizations of his experiments
- 4It can visualize all types of data regardless of the size and structure
- 5It has automated logging of hardware performance metrics
- Fully-managed, fast, and scalable
- Clean and responsive charts and visualizations
- Visualize different data types and structures
- Logging hardware performance metrics for experiments
“I started using Neptune, and then the more I used it, the more I felt like “okay, I would rather pay than maintain hold of this infrastructure myself.” – Kha Nguyen, Senior Data Scientist at Zoined
As Kha learnt while he used MLflow, a fully-managed infrastructure is the best bet in improving his experimentation process because it frees him up from worrying about infrastructure and operations workloads (which are not technically his core strengths) to focus on how he could improve his experiments.
Compared to MLflow, Neptune automatically scales to handle the artifacts and metadata that are logged for the 100s of experiments he runs for each of the company’s clients. The MLflow application will often crash whenever he tries to log a CSV file with more than 10,000 rows, halting his work and, ultimately, his productivity.
As he explained:
“In MLflow, when I log a CSV file that’s about 10,000 rows, MLflow just stops working. I click on the CSV file, it may take maybe three minutes before it shows up, and even when it starts, it doesn’t work smoothly anymore. It’s totally unusable but that’s not a problem with Neptune.” – Kha Nguyen, Senior Data Scientist at Zoined
“The more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself… I talked to Salsa (Zoined’s CEO) who asked about the pricing, and I said 50 dollars per month and that’s how we got in.” – Kha Nguyen, Senior Data Scientist at Zoined
As he found out, Neptune is a great alternative to previous solutions. For individuals, it costs nothing to use Neptune for work, research, and personal projects. For teams, pricing starts from $49 per entire team and they will only pay extra when they go over the free usage quota. Getting up and running with Neptune wasn’t a complicated process for Kha.
“It (Neptune) has much nicer visualizations or charts because sometimes when I wanna log some kind of chart or graph, MLflow can do that, but it will become really slow to open a chart.” – Kha Nguyen, Senior Data Scientist at Zoined
One of the well-known features of Neptune is the ability to customize charts and use automated visualization features that can save users a lot of time. For Kha, Neptune has much nicer and responsive visualization features for experiments and other metrics compared to MLflow.
With Neptune, Kha found that he could visualize data with Pandas data frames as he normally would in his workspace, he could also log large volumes of data being streamed for his experiment and everything still works smoothly.
Neptune scales to handle large volumes of data, fully managed, where Kha only has to worry about his experiments and not the underlying logging server. He also found Neptune’s ability to log data frames directly to the platform very useful.
One of the unique features that Neptune provides compared to MLflow is the option for users to log hardware metrics to provide an insightful look into how their experiments are doing and how much resources they are taking up. Kha particularly finds this feature useful so he can use the insights to improve his experiments in terms of resource usage.
As Kha explains:
“You also have automatic computing resource monitoring where you can start monitoring CPU and memory usage out-of-the-box. I think that’s cool so that we can gauge how much risk resources that we need to do. When I look at it, I can see that we are using too much RAM, or do we not have enough?. Do we need to use more CPU, for example?” – Kha Nguyen, Senior Data Scientist at Zoined
After a few months of using Neptune, how has it improved Kha’s experimentation workflow?
Overall, Neptune was able to meet the requirements of Kha, who is the individual Data Scientist on his team. It proved to be a useful solution because:
Having struggled earlier with the MLflow solution, a fully-managed solution allowed Kha to focus more on improving his experiments rather than configuring and maintaining infrastructure for logging his experiments regardless of the scale.
“I can pretty much log everything in Neptune and more…” – Kha Nguyen, Senior Data Scientist at Zoined
Neptune provides Kha with the option of customizing what he can log and also includes out-of-the-box options for metadata he can log. The option of logging large volumes of data also helps improve Kha’s experimentation workflow making it easy to have all his experiment optimization tools in a central place.
“I didn’t think about logging something like CPU metrics or memory metrics and it turned out to be pretty important when debugging something running in parallel with big data, for example. I didn’t think about that when I was using MLflow, so this is something that I find extremely helpful.” – Kha Nguyen, Senior Data Scientist at Zoined
Neptune’s hardware performance monitoring feature helped Kha to estimate the memory usage for his experiments and optimize accordingly, saving him money on the jobs he runs on Amazon Web Services.
“The more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself.” – Kha Nguyen, Senior Data Scientist at Zoined
Kha found out that Neptune is a more economical option compared to other solutions because not only did it cost him less compared to the time he was maintaining MLflow, the fully-managed solution reduced the bill he spent on hosting a logging software on Zoined’s infrastructure.
For Kha, Neptune proved to be a better alternative to MLflow not just economically but also in terms of his productivity in running numerous experiments.
“For now, I’m not using MLflow anymore ever since I switched to Neptune because I feel like Neptune is a superset of what MLflow has to offer.” – Kha Nguyen, Senior Data Scientist at Zoined
Thanks to Kha Nguyen for his help in creating this case study!
Want your team to focus on experiments instead of maintaining the infrastructure?
- Industry Analytics
- Team size 13
- Frameworks Python, Typer, Docker, AWS Batch
- Neptune use cases experiment tracking, debugging