Brainly is the leading learning platform worldwide, with the most extensive Knowledge Base for all school subjects and grades. Each month over 350 million students, parents and educators rely on Brainly as the proven platform to accelerate understanding and learning.
One of their core products and key entry points is Snap to Solve.
How Snap to Solve works
About the team
Brainly has an AI Services Department where it invests in producing ML as a Service in different areas such as content, user, curriculum, and visual search.
This case study shows how the Visual Search team integrated Neptune.ai with Amazon SageMaker Pipelines to track everything in the development phase of the Visual Content Extraction (VICE) system for Brainly’s Snap to Solve product.
- 1 Lead Data Scientist
- 2 Data Scientists
- 2 Machine Learning Engineers
- 1 MLOps (Machine Learning Operations) Engineer
- 1 Data Analyst
- 1 Data Labeling Lead
- 1 Delivery Manager
The team uses Amazon SageMaker to run their computing workloads and serve their models. In addition, they have adopted both Tensorflow and PyTorch to train lots of computer vision models, using either framework depending on the use case. Finally, to optimize the speed of data transformation with GPUs, they moved some of their data augmentation jobs to NVIDIA DALI.
The team works in two-week sprints and uses time-boxing to keep their research efforts focused and manage experimentation. They also keep their work processes flexible because they frequently adapt to the experiment results.
The Visual Search team at Brainly encountered several challenges while they developed the vision models for Snap to Solve that included:
- 1 Working on a variety of models at the same time
- 2 Separate engineers and non-technical users working on one or more projects simultaneously
- 3 Creating a single place to keep track of all the running experiments
The team would run training jobs in their pipelines to solve different problems connected to the same product. For instance, one job could be training the main models that detect objects in a scene; other jobs could be training auxiliary models for matching features, detecting edges, or cropping objects.
With multiple engineers working on the same or separate models, the team needed to develop a concrete plan to work together on the projects. In addition, the strategy had to accommodate sharing outcomes with non-technical users (such as product managers).
On the technical side, if they do not plan, the experimentation process would be challenging to manage and may be impossible to reproduce as the number of training runs grows.
When the number of training runs on the team’s large compute architectures increased, they realized that their logs from Amazon SageMaker needed to be trackable and manageable, or it would cause bottlenecks in their workflow.
“We ran a lot of experiments, and we needed a single place to track them.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly
While they tried leveraging SageMaker Experiments to track experiments, they needed a purpose-built tool to monitor them at scale and one that would integrate nicely with their technology stack.
While the engineers on the Visual Search team were planning their technology stack, they saw the need to include an experiment management tool:
The team concluded that they needed a tool that they could rely on to help them manage their experiments and integrate with SageMaker Pipelines, so they started searching for such solutions.
Searching for a tool to manage experiments at scale
In evaluating tools they could use, the team searched for one that fit the following criteria:
- 1 Easy and intuitive to use
- 2 Reliable
- 3 Has a reasonable pricing model
- 4 It just works for them — the tool should help them solve problems in their projects
The team did not consider SageMaker for experiment management when researching solutions for their stack. Instead, they explored solutions including Weights and Biases and MLFlow but could not settle with those options after using them for a trial period.
Settling for a solution that worked
While searching for other options, they decided to try out Neptune for their experiment management needs — which they finally decided to use. So why did they choose to go with Neptune for managing their experiments?
Neptune met the criteria they outlined while they were exploring different solutions. It provided:
- Custom integration with SageMaker to log training jobs at scale
- A reliable and intuitive tool that “just works”
- Enables teams to share experiment results and charts with anyone at no additional cost
- Flexible pricing options
- Helping the team debug and optimize computational resource consumption
Custom integration with SageMaker to log training jobs at scale
The team developed a custom template to integrate Neptune with Amazon SageMaker Pipelines using the `NEPTUNE_CUSTOM_RUN_ID` feature. First, they set the environment variable NEPTUNE_CUSTOM_RUN_ID to some unique identifier. Then, whenever they launch a new training job (through a Python script) from their local machine or in AWS, it will tell Neptune that the jobs started with the same NEPTUNE_CUSTOM_RUN_ID value should be treated as a single run.
This way, the team can run multiple pipeline jobs (such as training jobs and data pre-processing jobs), and all these different jobs will log their metadata to a single run in Neptune.
The value this provides the team is that they can log and retrieve experiment metadata and usage metrics for the whole computational pipeline (for example, the data pre-processing and training jobs) in a single run. It helps them organize their work efficiently and ensures they can easily reproduce their experiments from SageMaker Pipelines.
Reliable and intuitive tool that “just works”
When Hubert Bryłkowski, a Senior Machine Learning Engineer on Brainly’s Visual Search team, started using Neptune, he found the user interface intuitive. Whenever he uses Neptune, he realizes that it works and solves the problem it needs to solve — it is reliable.
In addition, using Neptune did not require him to read many guides or watch tutorials before he could complete tasks such as creating charts and dashboards — he appreciated the ease of use.
Enable teams to share experiment results and charts with anyone in the organization at no extra cost
Neptune provides the team with collaboration-rich features such as sharing experiments and plots with anyone through persistent links. Moreover, because of its ease of use, even non-technical users can navigate the interface.
Also, compared to other tools, where users would have to create separate accounts or be included in another payment plan to view results and collaborate with others, Neptune allows the team to add unlimited users at no extra cost.
Helping the team debug and optimize computational resource consumption
Neptune improved how the team utilized compute resources for their training and data processing jobs. For example, running large data processing jobs on distributed clusters is one of the most compute-intensive tasks for the team. Neptune provided them with better insights into how their image data augmentation programs utilized resources to maximize their GPU usage.
In one case, Hubert tried to test whether multithreading would be better than multiprocessing for their processing jobs. After a few trials and monitoring the jobs, he came to an informed conclusion which he could also share with his colleagues.
The team’s insights from monitoring the resource utilization of jobs help them manage Cloud consumption costs where their resources run. In most cases, they have found such insights also to be more detailed than any Cloud vendor would have provided:
Here’s a look at how Neptune integrates with other tools in the team’s continuous delivery (CD) pipeline:
Adding Neptune to the Visual Search team’s tool stack proved valuable as it:
“Using Neptune is like having a contact book but for our experiments.” – Mateusz Opala, Senior Machine Learning Engineer at Brainly
The most significant need for the team as of the time they set out to use an experiment management solution was a tool that could provide a single location for all their experiments and could scale regardless of the experiment volume. Neptune provided them with a single source of truth to conveniently access their experiment data through an interactive and customizable interface.
“Neptune covers a good portion of what a research logbook is supposed to be.” – Gianmario Spacagna, Director of AI at Brainly
Neptune integrated with the team’s core stack running on AWS, including Amazon SageMaker Pipelines, to seamlessly track data processing and training jobs in the continuous delivery pipeline. This feature enabled the team to log everything but track and visualize the most critical things in a single place. And they did not have to feel like Neptune was a separate tool in the stack.
“Neptune takes care of many things that we did not think we would need. And we realized that we could take a look at something and dig deeper than we already have.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly
So, for example, since they are logging all the possible metrics, they can hone in on one particular metric and see how it’s changed or has behaved in previous experiments.
“By logging the most crucial parameters, we can go back in time and see how it worked in the past for us, and this, I would say, is precious knowledge. On the other hand, if we track the experiments manually, that will never be the case, as we would only know what we thought we saw.” – Hubert Bryłkowski, Senior Machine Learning Engineer at Brainly
“The huge improvement in our batch data transformation process was decompressing JPEGs on GPUs with NVIDIA DALI. You can monitor such improvement with Neptune’s resource usage logs which are very helpful.” – Mateusz Opala, Senior Machine Learning Engineer at Brainly
The team would often use multi-GPU training to run distributed training jobs and understand and manage their resource usage. For example, in one case, when they trained a model on multiple GPUs, they would lose much time copying images from the CPU or running data augmentation on the CPU.
So they decided to optimize their data augmentation jobs by decompressing JPEGs and moving from plain Tensorflow and Keras to Nvidia DALI, which improved data processing speed.
Neptune helped them monitor the processes through a UI, allowing them to see how much of the CPU and GPU their operations were utilizing, leading to optimal usage of the resources. In addition, the insights helped them understand what processes were adequately using the resources and those that were not — for example, in the case of choosing multiprocessing over multithreading for processing jobs.
“We track every processing job we do, even batches for labeling — we track that in Neptune. So, whatever we do with data, we try to make it reproducible and trackable in Neptune.” – Mateusz Opala, Senior Machine Learning Engineer at Brainly
Using Neptune also helped the team track and ensure their data processing workflows were reproducible.
Thanks to Hubert Bryłkowski, Mateusz Opala, Gianmario Spacagna for working with us to create this case study!