Artificial intelligence is all the rage in 2020, but many aspiring technologists are running into a problem: training data.
Having a large, curated dataset is necessary for most artificial intelligence/machine learning applications. Acquiring that data is often a challenge.
Not only do you have to collect data from the real world, you must annotate and prepare it for your model. For students, small research teams, and early-stage startups, training data is a significant hurdle to overcome.
That’s where synthetic training data comes in handy. Synthetic data is fake data that mimics real data.
For certain ML applications, it’s easier to create synthetic data than to collect and annotate real data.
There are three major reasons for this:
- you can generate as much synthetic data as you need,
- you can generate data that may be dangerous to collect in reality,
- synthetic data is automatically annotated.
Let’s get into the details.
What is synthetic data?
One of the fundamental laws of machine learning is that you need a lot of data. The amount of data you need can range from ten thousand examples to billions of data points.
For complex applications such as autonomous vehicles, collecting a huge amount of high-quality training data is a challenge. Luckily, synthetic data works best for large datasets.
The most important thing to understand about real training data is that you collect it in a linear way.
In most cases, each additional training example takes approximately the same amount of time to collect as the previous example. That’s not the case with synthetic data.
One of the things that make synthetic data special is that it can be generated in massive quantities. Ten thousand training examples? No problem. A million examples? No problem. A billion? Well, you might need a more powerful GPU, but it’s doable.
In comparison, a billion real training examples might simply be impossible.
Why use synthetic data (synthetic vs real data)
Real data can be dangerous to collect. For example, autonomous vehicle AI cannot rely entirely on real data. Companies working on this technology, such as Alphabet’s Waymo, must run simulations.
Think about it: to train an AI to avoid a car crash, you need training data on crashes. But it is simply too expensive and risky to gather large datasets of real car crashes—so you simulate crashes instead.
Real data can be rare
The principle of dangerous collection can also apply to data that can be collected very rarely.
For example, if your AI algorithm is looking for a ‘needle in a haystack,’ synthetic data can generate rare events in sufficient quantity to accurately train an AI model.
Consider this – some of the most beneficial uses of AI are focused on ‘rare’ events. By the nature of these problems, rare events are hard to collect.
Going back to the automotive example, car crashes don’t happen so often, and you rarely have a chance to collect this data. With synthetic data, you choose how many crashes you want to simulate.
Synthetic data is fully user-controlled
Everything in a synthetic data simulation can be controlled. It’s a blessing and a curse.
It can be a curse because there are cases where synthetic data misses edge cases that can be captured in real datasets.
For these applications, you might want to utilize transfer learning to sprinkle in some real data with your synthetic datasets.
But this is also a blessing – event frequency, object distribution, and much more is up to you.
Synthetic data is perfectly annotated
Another advantage of synthetic data is perfect annotation. You never need to gather data by hand again.
A variety of annotations can be automatically generated for each object in a scene. This might not sound like a big deal, but it’s one of the big reasons why synthetic data is so cheap compared to real data.
You don’t pay for data labeling. Instead, the main cost of synthetic data is an upfront investment in building the simulation. After that, generating data is exponentially more cost-effective than real data.
Synthetic data can be multispectral
Autonomous vehicle companies have realized that annotating non-visible data is challenging. That’s why they have been some of the biggest proponents of synthetic data.
Companies like Alphabet’s Waymo and General Motors’s Cruise use simulations to generate synthetic LiDAR data. Since this data is synthetic, the ground-truth is known and data is automatically labeled.
Similarly, synthetic data works nicely for infrared or radar computer vision applications, where humans can’t fully interpret the imagery.
Where can you apply synthetic data?
Synthetic data has a lot of purposes. At the moment, there are two major fields of synthetic data: computer vision and tabular data.
Computer vision is when an AI algorithm is used to detect objects and patterns in images. Cameras are increasingly used in many industries, from automotive, to drones, to medicine.
Synthetic data combining with more advanced AI means that the technology of computer vision is just getting started.
Another use of synthetic data is in tabular data. Tabular synthetic data gets a lot of attention from researchers. MIT researchers recently released the Synthetic Data Vault, a collection of open source tools for spreadsheet-based synthetic data.
Health and privacy data are particularly ripe for a synthetic approach. These fields are highly restricted by privacy laws. Synthetic data can help researchers get the data they need without violating people’s privacy.
As new tools and tutorials are released, synthetic data will be able to play a larger and larger role in the development of AI.
High quantity data, dangerous collection of real data, and perfect annotation are the three big reasons to use synthetic data.
If you want to check out a real product, my partners and I released a free plugin for Unreal Engine to make it easier to generate synthetic data.
There are a lot of other tools to generate synthetic data. Whichever you choose, synthetic data can be a great way to get training data, and will likely be a big moving force for the next generation of AI.