Case Study

Zoined

The more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself.
Kha Nguyen
Senior Data Scientist at Zoined
Before
    Struggled to maintain an open source tracking solution
    Missing on the scalability aspect
After
    Have scalable, managed tracking tool that doesn't distract the team from building models

Zoined offers Retail and Hospitality Analytics as a cloud-based service for different roles from top management to manager level. The service collects sales data from stores and venues including inventories, time and attendance, and visitor tracking systems as well as webstores. The data is analyzed and presented in a very accessible, visual format for business owners so they can get real-time, actionable insights for their business and select the preferred time frames they want to report on. The product also allows businesses to filter and group their data easily and create custom views, grasp trends quickly with charts and graphs.

Zoined dashboard
Zoined dashboard | Source: Zoined

With ZoinedĀ® businesses have access to a fully-managed, off-the-shelf solution with ready-made dashboards and analytics for retail and wholesale, especially fashion, food retail, coffee shops’, and restaurants’ needs.

Problem

Running lots of experiments, especially in a start-up with few scientists and engineers solving problems, can get daunting. Tracking the experiments, versioning the datasets as they inevitably get larger, and generally taking procedures to get reproducible results can be very tricky to navigate. This was the problem Kha faced when he first joined Zoined.

avatar lazyload
quote
When I joined this company, we were doing quite many different experiments and it’s really hard to keep track of them all so I needed something to just view the result or sometimes or also it’s intermediate results of some experiments like what [does] the data frame look like? What [does] the CSV look like? Is it reasonable? Is there something that went wrong between the process that resulted in an undesirable result? So we were doing it manually first but just writing… some log value to some log server like a Splunk.
Kha Nguyen Senior Data Scientist at Zoined

In addition, he was the only one responsible for the forecasting pipeline in Zoined which made experiment tracking more tedious to conduct manually.

Kha was also working with large data frames with forecasts (predictions) that needed to be logged alongside their experiments, as well as find a way to visualize results for complete experiments and intermediate experiments so he can be more efficient during the experimentation process.

Problems with Splunk for experiment tracking

The first solution the team tried was to manually log experiment values to Splunk. For one, beginning with such a tool can be intimidating to get started with. 

Another problem is that visualizing logged (experiment) values is quite difficult to do and might require some expert help to set up. 

Finally, Splunk can get expensive pretty fast — especially for a company that runs a lot of experiments and will need to send a large volume of data to the log server.

Problems with maintaining MLflow

Reliability and speed of MLflow

The next solution Kha tried was MLflow. One of the issues he had with using MLflow was the hosting options available. As he mentioned, the hosted MLflow solution is Databricks. He started using a self-hosted MLflow solution but it quickly became difficult to manage for an individual. 

avatar lazyload
quote
… And then I came across ML flow. We started using it but I think that the only way to have a hosted MLflow solution is by using Databricks. Otherwise, we were hosting MLflow ourselves and I’m the only person responsible for the entire forecasting pipeline here. So it’s really a hassle for me to experiment and maintain MLflow at the same time that we have to prepare the database for it, S3 for it, and then some server to run it on.
Kha Nguyen Senior Data Scientist at Zoined

As he found out, using MLflow can be compute-intensive, consuming a lot of RAM and running quite slowly. 

avatar lazyload
quote
Sometimes, MLflow is not reliable because I think it’s not really optimized as it consumes quite a lot of RAM and runs really slow too.
Kha Nguyen Senior Data Scientist at Zoined

Hosting MLflow on a local server also comes with the problem of auto scaling for Kha. In most cases, MLflow couldn’t handle a large stream of logs either crashing or the UI stops responding, slowing down his experimentation workflow. As he mentions:

avatar lazyload
quote
The real headache came when we ran like 100 experiments, 100 forecasts at the same time and all of that started streaming data into MLflow. That’s when we see MLflow is not responding, not available.
Kha Nguyen Senior Data Scientist at Zoined

To get MLflow to work for the stream of logs, he had to scale up the number of instances which became a complex operations work to handle.

avatar lazyload
quote
… So I had to increase the number of instances but it doesn’t really make sense to have them scale up all the time. What I could do was to set up some kind of elastic scale for instance. This required more infrastructure maintenance because I felt like the Terraform infrastructure was growing quite big and I thought this is not a good direction to go because it may grow even bigger. That was when I felt like if I could share this with some other people, it would be more efficient.
Kha Nguyen Senior Data Scientist at Zoined

MLflow would have in fact been a great tool for Kha to manage 100s of experiments if only it didn’t have some of the issues we listed above.

Problems with collaboration on MLflow

Collaboration with a self-hosted MLflow solution was a problem for Kha because sharing experiments was difficult to do as he needed to create URL aliases for logs, especially if he wanted to share them with other collaborators. As he mentioned:

avatar lazyload
quote
There’s also the issue of creating an URL alias for it (MLflow). So I feel like why do I have to do all this manually?
Kha Nguyen Senior Data Scientist at Zoined

Solution

Kha needed a solution like MLflow but without the hassle that the self-hosted MLflow solution brought about. He needed a solution that:

  • Was completely managed,
  • Didn’t take too long to set up and get started with,
  • Could elastically scale to large volumes of experiments logs and forecast dataset,
  • Was also completely automated and fast,
  • Could be customized and integrated with existing technologies.

Kha decided to do some digging and came across Neptune, which met all the requirements he needed.

avatar lazyload
quote
I felt like ā€œwhy do I have to do all these manually?ā€ And then I came across Neptune and it seems like, okay, this is the host, the solution and it seems to be equivalent to MLflow.
Kha Nguyen Senior Data Scientist at Zoined

Kha decided to choose Neptune as Zoined’s solution for logging experiment metadata because:

  • 1 It is fully managed, fast, and scalable
  • 2 It is a better price to value ratio and accessible
  • 3 It has better charts and visualizations of his experiments
  • 4 It can visualize all types of data regardless of the size and structure
  • 5 It has automated logging of hardware performance metrics
  • ā€œI started using Neptune, and then the more I used it, the more I felt like ā€œokay, I would rather pay than maintain hold of this infrastructure myself.ā€ – Kha Nguyen, Senior Data Scientist at Zoined

    As Kha learnt while he used MLflow, a fully-managed infrastructure is the best bet in improving his experimentation process because it frees him up from worrying about infrastructure and operations workloads (which are not technically his core strengths) to focus on how he could improve his experiments.

    Compared to MLflow, Neptune automatically scales to handle the artifacts and metadata that are logged for the 100s of experiments he runs for each of the company’s clients. The MLflow application will often crash whenever he tries to log a CSV file with more than 10,000 rows, halting his work and, ultimately, his productivity.

    As he explained:

    ā€œIn MLflow, when I log a CSV file that’s about 10,000 rows, MLflow just stops working. I click on the CSV file, it may take maybe three minutes before it shows up, and even when it starts, it doesn’t work smoothly anymore. It’s totally unusable but that’s not a problem with Neptune.ā€ – Kha Nguyen, Senior Data Scientist at Zoined

  • ā€œThe more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself… I talked to Salsa (Zoined’s CEO) who asked about the pricing, and I said 50 dollars per month and that’s how we got in.ā€Ā ā€“Ā Kha Nguyen, Senior Data Scientist at Zoined

    As he found out, Neptune is aĀ great alternativeĀ to previous solutions. For individuals, it costs nothing to use Neptune for work, research, and personal projects. For teams, pricing starts from $49 per entire team and they will only pay extra when they go over the free usage quota. Getting up and running with Neptune wasn’t a complicated process for Kha.

  • ā€œIt (Neptune) has much nicer visualizations or charts because sometimes when I wanna log some kind of chart or graph, MLflow can do that, but it will become really slow to open a chart.ā€ – Kha Nguyen, Senior Data Scientist at Zoined

    One of the well-known features ofĀ Neptune is the ability to customize chartsĀ and use automatedĀ visualization featuresĀ that can save users a lot of time. For Kha, Neptune has much nicer and responsive visualization features for experiments and other metrics compared to MLflow.

  • With Neptune, Kha found that heĀ could visualize data with Pandas data framesĀ as he normally would in his workspace, he could also log large volumes of data being streamed for his experiment and everything still works smoothly.

    Neptune scales to handle large volumes of data, fully managed, where Kha only has to worry about his experiments and not the underlying logging server. He also found Neptune’s ability to log data frames directly to the platform very useful.

  • One of the unique features that Neptune provides compared to MLflow is theĀ option for users to log hardware metricsĀ to provide an insightful look into how their experiments are doing and how much resources they are taking up. Kha particularly finds this feature useful so he can use the insights to improve his experiments in terms of resource usage.

    As Kha explains:

    ā€œYou also have automatic computing resource monitoring where you can start monitoring CPU and memory usage out-of-the-box. I think that’s cool so that we can gauge how much risk resources that we need to do. When I look at it, I can see that we are using too much RAM, or do we not have enough?. Do we need to use more CPU, for example?ā€Ā ā€“Ā Kha Nguyen, Senior Data Scientist at Zoined

Results

After a few months of using Neptune, how has it improved Kha’s experimentation workflow?

Overall, Neptune was able to meet the requirements of Kha, who is the individual Data Scientist on his team. It proved to be a useful solution because:

  • Having struggled earlier with the MLflow solution, a fully-managed solution allowed Kha to focus more on improving his experiments rather than configuring and maintaining infrastructure for logging his experiments regardless of the scale.

  • ā€œI can pretty much log everything in Neptune and moreā€¦ā€ – Kha Nguyen, Senior Data Scientist at Zoined

    Neptune provides Kha with the option of customizing what he can log and also includes out-of-the-box options for metadata he can log. The option of logging large volumes of data also helps improve Kha’s experimentation workflow making it easy to have all his experiment optimization tools in a central place.

  • ā€œI didn’t think about logging something like CPU metrics or memory metrics and it turned out to beĀ pretty important when debugging something running in parallel with big data, for example. I didn’t think about that when I was using MLflow, so this is something that I find extremely helpful.ā€Ā ā€“ Kha Nguyen, Senior Data Scientist at Zoined

    Neptune’s hardware performance monitoring feature helped Kha to estimate the memory usage for his experiments and optimize accordingly, saving him money on the jobs he runs on Amazon Web Services.

  • ā€œThe more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself.ā€Ā ā€“ Kha Nguyen, Senior Data Scientist at Zoined

    Kha found out that Neptune is a more economical option compared to other solutions because not only did it cost him less compared to the time he was maintaining MLflow, the fully-managed solution reduced the bill he spent on hosting a logging software on Zoined’s infrastructure.

    For Kha, Neptune proved to be a better alternative to MLflow not just economically but also in terms of his productivity in running numerous experiments.

    ā€œFor now, I’m not using MLflow anymore ever since I switched to Neptune because I feel like Neptune is a superset of what MLflow has to offer.ā€ – Kha Nguyen, Senior Data Scientist at Zoined


Thanks to Kha Nguyen for his help in creating this case study!

avatar
quote
The more I used Neptune, the more I felt that I would rather pay for a hosted solution than have to maintain the infrastructure myself.
Kha Nguyen Senior Data Scientist at Zoined

Want your team to focus on experiments instead of maintaining the infrastructure?