Code and data are the foundations of the AI system. Both of these components play an important role in the development of a robust model but which one should you focus on more? In this article, we’ll go through the data-centric vs model-centric approaches, and see which one is better, we would also talk about how to adopt data-centric infrastructure.
The model-centric approach means developing experimental research to improve the ml model performance. This involves selecting the best model architecture and training process from a wide range of possibilities.
- In this approach you keep the data the same, and improve the code or model architecture.
- Working on code is the central objective of this approach.
Model-centric trends in AI world
Currently, the majority of AI applications are model-centric, one possible reason behind this is that the AI sector pays careful attention to academic research on models. According to Andrew Ng, more than 90% of research papers in this domain are model-centric. This is because it is difficult to create large datasets that can become generally recognized standards. As a result, the AI community believes that model-centric machine learning is more promising. While focusing on the code, data is frequently overlooked, and data collection is viewed as a one-time event.
In an age where data is at the core of every decision-making process, a data-centric company can better align its strategy with the interests of its stakeholders by using information generated from its operations. This way the result can be more accurate, organized, and transparent which can help an organization run more smoothly.
- This approach involves systematically altering/improving datasets in order to increase the accuracy of your ML applications.
- Working on data is the central objective of this approach.
May interest you
The data-driven/data-centric conundrum
Many people often get confused between a data-centric and a data-driven approach. A data-driven approach is a methodology for gathering, analyzing, and extracting insights from your data. It’s sometimes referred to as “analytics.” The data-centric approach on the other hand is focused around using data to define what you should create in the first place.
- Data centric architecture refers to a system in which data is the primary and permanent asset, whereas applications change.
- Data-driven architecture means the creation of technologies, skills, and an environment by ingesting a large amount of data.
Let’s now talk about how a data-centric approach differs from a model-centric approach and the need for it in the first place.
Data-centric approach vs model-centric approach
To data scientists and machine learning engineers, the model-centric approach may seem more pleasant. This is understandable since practitioners may use their knowledge to tackle a specific problem. On the other hand, no one wants to spend the entire day labeling data because it is considered a one-time job.
However, in today’s machine learning, data is crucial, yet it’s often overlooked and mishandled in AI initiatives. As a result, hundreds of hours are wasted fine-tuning a model based on faulty data. That could very well be the fundamental cause for your model’s lower accuracy, and it has nothing to do with model optimization.
Working on code is the central objective
Working on data is the central objective
Optimizing the model so it can deal with the noise in the data
Rather than gathering more data, more investment is being made in data quality tools to work on noisy data
Inconsistent data labels
Data consistency is key
Data is fixed after standard preprocessing
Code/algorithms are fixed
Model is improved iteratively
Iterated the data quality
You don’t have to become completely data-centric; sometimes it’s important to focus on model and code. It’s great to do research and improve models, but data is also important. We tend to overlook the importance of data while focusing on the model. The best way is to adopt a hybrid approach that focuses on both data and model. Depending on your application, you can focus more on data and less on model, but both should be taken into account.
The need for a data-centric infrastructure
Model-centric ML refers to machine learning systems that are primarily concerned with optimizing model architectures and their parameters.
The model-centric workflow depicted in the graphic above is suitable for a few industries, such as media and advertising, but consider healthcare or manufacturing. They may face challenges such as:
1. High-level customization is required
Unlike media and advertising industries, a manufacturing business with several goods cannot use a single machine learning system to detect production faults across all of its products. Instead, each manufactured product would require a distinctly trained ML model.
While media companies can afford to have an entire ML department working on each and every little optimization problem, a manufacturing business that requires several ML solutions cannot follow such a template in terms of size.
2. Importance of large datasets
In most cases, companies do not have a large number of data points to work with. Instead, they are often forced to deal with tiny datasets, which are prone to disappointing outcomes if their approach is model-centric.
Andrew NG explains how he believes a data-centric ML is more rewarding and advocates for a revolution in the community toward data-centrism in his AI talk. He gives an example of a steel defect detection problem statement in which the model-centric approach fails to improve the model’s accuracy, while the data-centric approach boosts the accuracy by 16%.
Data is extremely important in AI research, and adopting a strategy that prioritizes obtaining high-quality data is critical – after all, relevant data is not just rare and noisy, but also extremely expensive to get. The idea is that AI should be treated in the same way that we would care for the greatest materials while building a house. They should be evaluated at each level rather than as a one-time event.
Adopting a data-centric infrastructure
Treat data as a fundamental asset that will outlast applications and infrastructure when implementing a data-centric architecture. This approach does not need a single database or data repository, but rather a shared understanding of the data with a uniform description. Data-centric ML makes data sharing and movement simple.
So, what exactly does data-centric machine learning involve? What essential factors should you consider while implementing a data-centric approach?
1. Data label quality
Data labeling is the process of assigning one or more labels to data. Labels are associated with specific values that are applied to the data. When a significant number of images are incorrectly labeled, the results are lower than when fewer but accurate images are used.
The labels provide detailed information about the content and structure of a dataset, which may include components such as what data types, measurement units, and time periods are represented in the dataset. Best way to improve label quality is to find the inconsistencies in labels and work on the labeling instructions. Later in this article, we’ll learn more about the importance of data quality.
2. Data augmentation
Data augmentation is a data analysis task involving the creation of data points through interpolation, extrapolation, or other means. It can be used to introduce more training data for machine learning, or it can be used to make synthetic images or video frames with varying degrees of realism. It helps to enhance the number of relevant data points, such as the number of faulty production components by creating data that your model hasn’t seen yet throughout the training period.
However, adding data isn’t always the best option. Getting rid of the noisy observations that cause the high variance improves the model’s capacity to generalize to new data.
3. Feature engineering
Feature engineering is the process of adding features to a model by altering input data, prior knowledge, or algorithms. It is used in machine learning to help increase the accuracy of a predictive model.
Improving data quality involves improving both the input data and the target/labels. Feature engineering is crucial for adding features that may not exist in their raw form but can make a significant difference.
4. Data versioning
In any software application, data versioning plays an important role. As a developer, you want to track down bugs by comparing two versions and see something that just doesn’t make sense anymore. Or maybe you could’ve prevented that bug by deploying that particular version again. Managing dataset access, as well as the many versions of each dataset throughout time, is difficult and error-prone. Data versioning is one of the most integral steps in maintaining your data – it’s what helps you keep track of changes (both additions and deletions) to your data set. Versioning makes it easy to collaborate on code and manage datasets.
Versioning also makes it easy to manage ML Pipeline from proof of concept to production, this is when MLOps tools come to the rescue. You might be wondering why MLOps tools are discussed in the context of “Data Versioning”. It’s because managing data pipelines is a significantly difficult task in the development of machine learning applications. Versioning ensures reproducibility and reliability. Here are a few best platforms for data versioning:
Neptune is a metadata store for MLOps, developed for research and production teams. It gives you a central hub to log, store, display, organize, compare, and query all metadata generated during the machine learning lifecycle. In the context of data versioning, with Neptune you can:
- Keep track of a dataset version in your model training runs with artifacts.
- Query the dataset version from previous runs to make sure you are training on the same dataset version.
- Organize dataset version metadata in the Neptune UI.
Weights & Biases (WandB) is a platform that provides machine learning tools for researchers and deep learning teams. WandB helps you with experiment tracking, dataset versioning, and model management. With WandB you can :
- Use Artifacts for dataset versioning, model versioning, and tracking dependencies and results across machine learning pipelines.
- You can store complete datasets in artifacts directly, or use artifacts references to point to data in other systems like as S3, GCP, or on your local machine.
DVC is an open-source platform for machine learning projects. DVC helps data scientists and developers with data versioning, workflow management, and experiment management. DVC lets you:
- Capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage.
- Switch between different data contents.
- Produce metafiles, describing what datasets, ML artifacts, to track.
5. Domain knowledge
Domain knowledge is extremely valuable in a data-centric approach. Subject matter experts can often detect small discrepancies that ML engineers, data scientists, and labelers cannot. Involving domain experts is still missing in the ML system. ML systems might perform better if additional domain knowledge is available.
Benefits of data-centric approach
The advantages of becoming more data-centric are numerous, ranging from improved reporting speed and accuracy to better-informed decision-making. Data-centric infrastructure comes with lots of benefits:
- Improves accuracy, using data as a strategic asset effectively assures more precise estimates, observations, and decisions.
- Eliminates complex data transformations.
- Reduces data errors and inconsistencies.
- Provides valuable insights into internal and external trends that help you make better decisions.
- Reduces expenses.
- Makes data more accessible to key stakeholders.
- Reduce data redundancy.
- Improves data quality and reliability.
Which one to prioritize: data quantity or data quality?
Before going any further, I’d want to emphasize that more data does not automatically equal better data. Sure, a neural network can’t be trained with a few images, but the emphasis is now on quality rather than a number.
It refers to the amount of data accessible. The main goal is to gather as much data as possible and then train a neural network to learn the mappings.
As seen in the above graphic, the majority of Kaggle datasets aren’t that large. In a data-centric approach, the size of the dataset isn’t that important and a lot could be done with a small quality dataset.
Data quality, as the name suggests, is all about quality. It makes no difference if you don’t have millions of datasets; what matters is that they are of high quality and properly labelled.
You can see a different way to label the data in the above graphic; there’s nothing wrong with labeling it independently or combined. For example, If data scientist 1 labels pineapple separately but data scientist 2 labels it combined, the data will be incompatible, causing the learning algorithm to grow confused. The main goal is to maintain consistency in labels; if you’re labeling it independently, make sure all labels are labeled the same way.
Data annotation consistency is critical since any discrepancy might derail the model and make your evaluation inaccurate. As a result, you’ll need to carefully define annotation guidelines to ensure that ML engineers and data scientists label data consistently. According to research, roughly 3.4 percent of samples in frequently used datasets were mislabeled, with the large models being the most affected.
In the above image, Andrew Ng explains the importance of consistency in small datasets. The graph above illustrates the relation between the voltage and speed of a drone. You can confidently fit the curve and get higher accuracy if you have small datasets but consistent labels.
A low-quality piece of data means that flaws and inaccuracies can go undetected indefinitely without any consequences. The accuracy of models depends on the quality of your data; if you want to make good decisions then you need accurate information. Data with poor attributes is at risk for containing errors and anomalies which can be very costly when using predictive analytics and modeling techniques.
When it comes to data, how much is too much?
The amount of data you have is critical; you must have enough data to solve your problem. Deep Networks are low-bias, high-variance computers, and we believe that the solution to the variance problem is more data. But how Much Data Is Enough? That’s a more difficult question to answer than you would expect. Yolov5 suggests :
- Minimum 1.5k images per class
- Minimum 10k instances (labeled objects) per class total
Having a large amount of data is a benefit, not a must.
Best practices for data-centric approach
Keep these things in mind if you’re adopting a data-centric approach:
- Ensure high-quality data consistency across the ML project lifecycle.
- Make the labels consistent.
- Use production data to get timely feedback.
- Use error analysis to focus on a subset of data.
- Eliminate the noisy samples; as discussed above more data is not always better.
Where to find good datasets?
Obtaining high-quality datasets is an important task. So, here are a few sites where you can get such datasets for free.
The first one is well known to the data science community. Inside Kaggle you’ll find all the code & data you need to do your data science work. It has over 50,000 public datasets and 400,000 public notebooks, allowing it to quickly complete any analysis.
Datahub is a dataset platform mostly focused on business and finance. Many datasets, such as lists of nations, populations, and geographic borders, are currently available on DataHub, with many more being developed.
Graviti is a new data platform, providing high-quality datasets mostly for computer vision. Individual developers or organizations can easily access, share, and better manage large amounts of open data.
In this article, we learned how a data-centric approach differs from a model-centric approach, and how to make your machine learning application more data-centric. We don’t have to limit ourselves in a single direction, code and data both play an important role in the AI journey. There is no hard and fast rule for choosing between model-centric and data-centric approaches but the robustness of the dataset shouldn’t be overlooked.
Data quality must be maintained and improved at every stage of AI development, each of which will, by definition, require various frameworks and tools. If you would like to dwell deeper into this, I have shared a few references down below to get you started.
Hope you liked the article, keep experimenting!
References and recommended reading
- A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
- Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
- Data-centric Machine Learning: Making customized ML solutions production-ready
- The Significance of Data-centric AI
- Tips for Best Training Results
- Data-centric AI: Real World Approaches