We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Read more
Case Study

Brainly

"Neptune’s UI and the front-end work great, and you don’t feel that you ‘fight’ with it. So instead of ‘fighting’ the tool, the tool itself is helping."

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

Brainly logo

Brainly is the leading learning platform worldwide, with the most extensive Knowledge Base for all school subjects and grades. Each month over 350 million students, parents and educators rely on Brainly as the proven platform to accelerate understanding and learning.

One of their core products and key entry points is Snap to Solve

How Snap to Solve works

Snap to Solve is a machine learning-powered product that lets users take and upload a photo; Snap to Solve then detects the question or problem in that photo and provides solutions. 

Snap to Solve offers such solutions by matching users with other Brainly product features such as Community Q&A (a Knowledge Base of questions and answers) or Math Solver (providing step-by-step solutions to math problems).

Case study with Brainly: How Snap to Solve works
How Snap to Solve works | Source

About the team

Brainly has an AI Services Department where it invests in producing ML as a Service in different areas such as content, user, curriculum, and visual search.

This case study shows how the Visual Search team integrated Neptune.ai with Amazon SageMaker Pipelines to track everything in the development phase of the Visual Content Extraction (VICE) system for Brainly’s Snap to Solve product.

Team details

  • 1 Lead Data Scientist
  • 2 Data Scientists
  • 2 Machine Learning Engineers
  • 1 MLOps (Machine Learning Operations) Engineer
  • 1 Data Analyst
  • 1 Data Labeling Lead
  • 1 Delivery Manager

Workflow

The team uses Amazon SageMaker to run their computing workloads and serve their models. In addition, they have adopted both Tensorflow and PyTorch to train lots of computer vision models, using either framework depending on the use case. Finally, to optimize the speed of data transformation with GPUs, they moved some of their data augmentation jobs to NVIDIA DALI. 

The team works in two-week sprints and uses time-boxing to keep their research efforts focused and manage experimentation. They also keep their work processes flexible because they frequently adapt to the experiment results.

Problem

The Visual Search team at Brainly encountered several challenges while they developed the vision models for Snap to Solve that included:

  • 1Working on a variety of models at the same time
  • 2Separate engineers and non-technical users working on one or more projects simultaneously
  • 3Creating a single place to keep track of all the running experiments
Working on a variety of models at the same time
  • Working on a variety of models at the same time
  • Separate engineers and non-technical users working on one or more projects simultaneously
  • Creating a single place to keep track of all the running experiments

The team would run training jobs in their pipelines to solve different problems connected to the same product. For instance, one job could be training the main models that detect objects in a scene; other jobs could be training auxiliary models for matching features, detecting edges, or cropping objects. 

With multiple engineers working on the same or separate models, the team needed to develop a concrete plan to work together on the projects. In addition, the strategy had to accommodate sharing outcomes with non-technical users (such as product managers).

On the technical side, if they do not plan, the experimentation process would be challenging to manage and may be impossible to reproduce as the number of training runs grows.

When the number of training runs on the team’s large compute architectures increased, they realized that their logs from Amazon SageMaker needed to be trackable and manageable, or it would cause bottlenecks in their workflow.

“We ran a lot of experiments, and we needed a single place to track them.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly

While they tried leveraging SageMaker Experiments to track experiments, they needed a purpose-built tool to monitor them at scale and one that would integrate nicely with their technology stack.

Solution

While the engineers on the Visual Search team were planning their technology stack, they saw the need to include an experiment management tool:

“In the first month, we discussed what our ideal environment for machine learning (ML) development would look like, and experiment tracking was a key part of it.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

The team concluded that they needed a tool that they could rely on to help them manage their experiments and integrate with SageMaker Pipelines, so they started searching for such solutions.

Searching for a tool to manage experiments at scale

In evaluating tools they could use, the team searched for one that fit the following criteria:

  • 1Easy and intuitive to use
  • 2Reliable
  • 3Has a reasonable pricing model
  • 4It just works for them — the tool should help them solve problems in their projects

The team did not consider SageMaker for experiment management when researching solutions for their stack. Instead, they explored solutions including Weights and Biases and MLFlow but could not settle with those options after using them for a trial period.

Settling for a solution that worked

While searching for other options, they decided to try out Neptune for their experiment management needs — which they finally decided to use. So why did they choose to go with Neptune for managing their experiments?

“The GUI is intuitive and has a modern feel, and the Python client has all our required features. In addition, the flexible payment plans were a plus for us.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

Neptune met the criteria they outlined while they were exploring different solutions. It provided:

  • Custom integration with SageMaker to log training jobs at scale
  • A reliable and intuitive tool that “just works”
  • Enables teams to share experiment results and charts with anyone at no additional cost
  • Flexible pricing options
  • Helping the team debug and optimize computational resource consumption

Custom integration with SageMaker to log training jobs at scale

“We are running our training jobs through SageMaker Pipelines, and to make it reproducible, we need to log each parameter when we launch the training job with SageMaker Pipeline. A useful feature here is the `NEPTUNE_CUSTOM_RUN_ID` environment variable.”

Mateusz Opala

Mateusz Opala

Senior Machine Learning Engineer at Brainly

The team developed a custom template to integrate Neptune with Amazon SageMaker Pipelines using the `NEPTUNE_CUSTOM_RUN_ID` feature. First, they set the environment variable NEPTUNE_CUSTOM_RUN_ID to some unique identifier. Then, whenever they launch a new training job (through a Python script) from their local machine or in AWS, it will tell Neptune that the jobs started with the same NEPTUNE_CUSTOM_RUN_ID value should be treated as a single run.

This way, the team can run multiple pipeline jobs (such as training jobs and data pre-processing jobs), and all these different jobs will log their metadata to a single run in Neptune. 

The value this provides the team is that they can log and retrieve experiment metadata and usage metrics for the whole computational pipeline (for example, the data pre-processing and training jobs) in a single run. It helps them organize their work efficiently and ensures they can easily reproduce their experiments from SageMaker Pipelines.

Case study with Brainly: machine learning pipeline
Machine learning pipeline | Source: Author

Reliable and intuitive tool that “just works”

“Neptune’s UI and the front-end work great, and you don’t feel that you ‘fight’ with it. So instead of ‘fighting’ the tool, the tool itself is helping.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

When Hubert Bryłkowski, a Senior Machine Learning Engineer on Brainly’s Visual Search team, started using Neptune, he found the user interface intuitive. Whenever he uses Neptune, he realizes that it works and solves the problem it needs to solve — it is reliable. 

In addition, using Neptune did not require him to read many guides or watch tutorials before he could complete tasks such as creating charts and dashboards — he appreciated the ease of use.

“When it comes to concrete features, I like that you can create views with the UI. It’s fairly clean and intuitive. It looks like someone thought out when and how to put stuff, so you do not need to read or watch a lot of guides because you can launch it and simply work with it. And this is the kind of thing that results in this rich user experience.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

Enable teams to share experiment results and charts with anyone in the organization at no extra cost

“An important detail that we considered when we decided to choose Neptune is that we can invite everybody on Neptune, even non-technical people like product managers — there is no limitation on the users. This is great because, on AWS, you’d need to get an additional AWS account, and for other experiment tracking tools, you may need to acquire a per-user license.”

Gianmario Spacagna

Gianmario Spacagna

Director of AI at Brainly

Neptune provides the team with collaboration-rich features such as sharing experiments and plots with anyone through persistent links. Moreover, because of its ease of use, even non-technical users can navigate the interface. 

Also, compared to other tools, where users would have to create separate accounts or be included in another payment plan to view results and collaborate with others, Neptune allows the team to add unlimited users at no extra cost.

Helping the team debug and optimize computational resource consumption 

“The most surprising feature that Neptune gives us excellent insight on is simple data processing jobs and not just training because then we can, for example, monitor the usage of resources—whether we are using all cores of the machines. In a few lines of code, we have much better visibility.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

Neptune improved how the team utilized compute resources for their training and data processing jobs. For example, running large data processing jobs on distributed clusters is one of the most compute-intensive tasks for the team. Neptune provided them with better insights into how their image data augmentation programs utilized resources to maximize their GPU usage. 

In one case, Hubert tried to test whether multithreading would be better than multiprocessing for their processing jobs. After a few trials and monitoring the jobs, he came to an informed conclusion which he could also share with his colleagues.

Case study with Brainly: GPU Memory
Brainly’s Visual Search team uses Neptune to monitor the resource consumption of their training and data processing jobs | Source

The team’s insights from monitoring the resource utilization of jobs help them manage Cloud consumption costs where their resources run. In most cases, they have found such insights also to be more detailed than any Cloud vendor would have provided:

“I would say Neptune makes it easier for us to get deeper insights into resource utilization than would have been provided by a Cloud vendor or wherever our jobs are running.”

Hubert Brylkowski

Hubert Bryłkowski

Senior Machine Learning Engineer at Brainly

Case study with Brainly: Brainly’s Visual Search team uses Neptune to monitor how many training and data processing jobs utilize resources
Brainly’s Visual Search team uses Neptune to monitor how many training and data processing jobs utilize resources | Source

Results

Here’s a look at how Neptune integrates with other tools in the team’s continuous delivery (CD) pipeline:

Continuous delivery pipeline
Continuous delivery pipeline | Source

Adding Neptune to the Visual Search team’s tool stack proved valuable as it:

Provided a single source of truth to track all the experiments for their computer vision (CV) model building workflow

“Using Neptune is like having a contact book but for our experiments.”Mateusz Opala, Senior Machine Learning Engineer at Brainly

The most significant need for the team as of the time they set out to use an experiment management solution was a tool that could provide a single location for all their experiments and could scale regardless of the experiment volume. Neptune provided them with a single source of truth to conveniently access their experiment data through an interactive and customizable interface. 

“Neptune covers a good portion of what a research logbook is supposed to be.” – Gianmario Spacagna, Director of AI at Brainly

Seamless integration with Amazon SageMaker for logging and tracking

Neptune integrated with the team’s core stack running on AWS, including Amazon SageMaker Pipelines, to seamlessly track data processing and training jobs in the continuous delivery pipeline. This feature enabled the team to log everything but track and visualize the most critical things in a single place. And they did not have to feel like Neptune was a separate tool in the stack.

Provided a better debugging experience for their model development workflow

“Neptune takes care of many things that we did not think we would need. And we realized that we could take a look at something and dig deeper than we already have.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly

So, for example, since they are logging all the possible metrics, they can hone in on one particular metric and see how it’s changed or has behaved in previous experiments. 

“By logging the most crucial parameters, we can go back in time and see how it worked in the past for us, and this, I would say, is precious knowledge. On the other hand, if we track the experiments manually, that will never be the case, as we would only know what we thought we saw.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly

Helped the team optimize their data augmentation process on distributed clusters

“The huge improvement in our batch data transformation process was decompressing JPEGs on GPUs with NVIDIA DALI. You can monitor such improvement with Neptune’s resource usage logs which are very helpful.” – Mateusz Opala, Senior Machine Learning Engineer at Brainly

The team would often use multi-GPU training to run distributed training jobs and understand and manage their resource usage. For example, in one case, when they trained a model on multiple GPUs, they would lose much time copying images from the CPU or running data augmentation on the CPU. 

So they decided to optimize their data augmentation jobs by decompressing JPEGs and moving from plain Tensorflow and Keras to Nvidia DALI, which improved data processing speed

Neptune helped them monitor the processes through a UI, allowing them to see how much of the CPU and GPU their operations were utilizing, leading to optimal usage of the resources. In addition, the insights helped them understand what processes were adequately using the resources and those that were not — for example, in the case of choosing multiprocessing over multithreading for processing jobs.

Made their data processing workflows reproducible

“We track every processing job we do, even batches for labeling — we track that in Neptune. So, whatever we do with data, we try to make it reproducible and trackable in Neptune.” – Mateusz Opala, Senior Machine Learning Engineer at Brainly

Using Neptune also helped the team track and ensure their data processing workflows were reproducible.


Thanks to Hubert Bryłkowski, Mateusz Opala, Gianmario Spacagna for working with us to create this case study.

Looking for experiment tracking & model registry tool that will grow with you to reasonable scale and beyond?

Get started with Neptune
Brainly logo
  • Industry Ed-tech
  • Location Poland, the US, Spain, India
  • Team size 400+
  • Frameworks MMDetection, PyTorch, TensorFlow, NVIDIA DALI, cuML from RAPIDS.AI
  • Neptune use cases Experiment tracking in Amazon SageMaker pipelines