Neptune Blog

Synthetic Data for LLM Training

Klea Ziu

8 min

6th November, 2025

LLMOps

Synthetic data is widely used to train foundation models when data is scarce, sensitive, or costly to collect.

This data enables progress in domains like medical imaging, tabular data, and code by expanding datasets while protecting privacy.

Depending on the domain, different generation techniques, like Bayesian networks, GANs, diffusion models, and LLMs, can be used to generate synthetic data.

Training foundation models at scale is constrained by data. Whether working with text, code, images, or multimodal inputs, the public datasets are saturated, and private datasets are restricted. Collecting or curating new data is slow and expensive while the demand for larger, more diverse corpora continues to grow.

Synthetic data, artificially generated information that mimics real-world data, offers a practical solution. By generating synthetic samples, practitioners can avoid costly data acquisition and circumvent privacy concerns. Blending synthetic data with collected datasets improves robustness, scalability, and compliance in foundation models training.

When is synthetic data (un)suitable?

Synthetic data helps expand limited datasets, protects privacy when real data is sensitive, rare, or difficult to access. It also makes it easier to test models safely before deployment and to explore new scenarios without collecting costly or restricted real-world samples.

However, synthetic data is not the perfect replacement. Its success depends on how well it captures the patterns, distribution, and complexity of the real data, which varies from one domain to another.

Vision and healthcare

Computer vision and healthcare often intersect through medical imaging, one of the most data-intensive and regulated areas of AI research. Training diagnostic models for tasks like tumor detection, organ segmentation, or disease classification requires a large number of high-quality, labelled scans (X-ray, MRIs, or CT scans).

Collecting and labelling these images is expensive, time-consuming, and restricted by privacy laws or data sharing agreements. By generating artificial images and labels, researchers can expand datasets, balance rare disease categories, and test models without accessing real patient data. Synthetic medical images and patient records preserve the statistical properties of the real data while protecting privacy, enabling applications ranging from diagnostic imaging and drug discovery to clinical trial simulations.

Financial tabular data

Sharing data in the business sector is heavily constrained, making it difficult to gain insights from it even within the organization. Using synthetic data makes it easier to study the trends while maintaining the privacy and security of both customers and companies, and makes data more accessible.

For instance, financial data is highly sensitive and protected by very strict regulations, and synthetic data mimics the real data distribution without revealing customer information. This enables institutions to analyse data while complying with privacy laws. Moreover, synthetic data allows testing and validation of financial algorithms under different market scenarios, including rare or extreme events that may not be present in historical data. It also helps to have more accurate risk assessments, fraud, and anomaly detection.

Software code

In software development, synthetic code generation has become an important tool for training and testing. By simulating different coding scenarios, bug patterns, and software behaviours, researchers can create large datasets beyond what exists on open repositories. These synthetic examples support the development of personalized coding assistants and improve models for tasks like code completion and error detection.

Text

Text is where the limits of synthetic data are most visible. Large language models can generate a large amount of synthetic text, but evaluating the quality of text is subjective and highly context-dependent.

As there is no clear metric for what makes a text “good”, synthetically generated text often is generic, shallow, or irrelevant, especially on open-ended tasks. This is why techniques like reinforcement learning from human feedback (RLHF) and instruction tuning are needed to align models towards useful, human-like responses. While synthetic text can enrich training corpora, it remains a supplement rather than a replacement for human-written data.

Where does foundation model training data come from, and what role does it play?

A foundation model requires a certain number of data samples to learn a concept or relationship. The relevant quantity is not the number or size of the data samples but the amount of pertinent data samples contained in a dataset.

This becomes a problem for signals that rarely occur and thus are rare in collected data. To include a sufficient number of data samples that contain the signal, the dataset has to become very large, even though the majority of the additionally collected data samples are redundant.

Oversampling rare signals risks overfitting on the samples rather than learning robust representations of the signal. A more useful approach is to create data samples that contain the rare signal artificially.

Many foundation model teams utilize synthetic data and treat its generation as an inherent part of their foundation model efforts. They develop their own approaches, building on established methods and recent progress in the field.

Read more about how leading foundation model teams curate their training data and other topics in the State of Foundation Model Training Report 2025.

How is synthetic data generated?

Choosing the right synthetic data generation technique depends on the type of data and its complexity. Different domains rely on different techniques, each with its strengths and limitations. Here, we will focus on three domains where synthetic data is most actively used: medical imaging, tabular data, and code.

Category	Techniques	Domains	Strengths and Limitations
Statistical	Probability distribution,Bayesian network	Tabular data, Healthcare records	Captures dependencies, Privacy-friendly, Struggles with rare/outlier events
Generative AI	GANs,VAEs,Diffusion models,LLM	Images, Code, Tabular	Speed, Hallucination, Limited by the diversity of the real data

Medical imaging

Medical imaging, from MRIs and CT scans to ultrasounds, is at the core of modern healthcare for diagnosis, treatment planning, and disease monitoring. Yet, this data is often scarce, costly to annotate, or restricted due to privacy concerns, making it difficult to train large foundation models. Synthetic medical images offer numerous benefits by addressing these challenges. Some of the methods to generate synthetic medical imaging data include GANs and diffusion models.

GANs

Generative adversarial networks (GANs) consist of two neural networks: 1) a generator that generates synthetic images and 2) a discriminator that distinguishes the real data from fake ones. Both networks are trained simultaneously, where the generator adjusts its parameters based on feedback from the discriminator until the generated image is indistinguishable from the real image. Once trained, GANs can generate synthetic images from random noise.

In medical imaging, GANs are widely used for image reconstruction across modalities such as MRIs, CT scans, X-rays, ultrasound, and tomography. Most of these modalities suffer from noisy, low-resolution, or blurry images, which hinder accurate diagnostics. GAN-based approaches, such as CycleGAN, CFGAN, and SRGAN, help improve resolution, reduce noise, and enhance image quality.

Despite these advancements, GANs face limitations in generalizability, require high computational resources, and still lack sufficient clinical validation.

GAN architecture. The image generator generates synthetic data, and the discriminator aims to distinguish whether the given data is real or fake. As training progresses, the image generator and the discriminator improve in tandem. | Source

Diffusion models

Diffusion models are generative models that learn from data during training and generate similar images based on what they have learned. In the forward pass, a diffusion model adds noise to the training data and then learns how to recover the original image in the reverse process by removing noise step by step. Once trained, the model can generate images by sampling random noise and passing it through the denoising process.

The bottleneck of diffusion models is that it takes time to generate the image starting from the noise. One solution is to encode the image into the latent space, perform the diffusion process in the latent space, and then decode the latent representation into an image, a technique called Stable Diffusion. This advancement enhances the speed, model stability, robustness, and reduces the cost of image generation. To gain more control over the generation process, ControlNet added the spatial conditioning option so the output can be customized based on the specific task.

Forward and reverse diffusion process. The forward process gradually adds noise to real data until structure is lost, while the reverse process learns to remove noise step by step to reconstruct realistic synthetic samples. | Source

Medical Diffusion enables generating realistic three-dimensional (3D) data, such as MRIs and CT scans. A VQ-GAN is used to create a latent representation from 3D data, and then a diffusion process is applied in this latent space. Similarly, MAISI, an Nvidia AI foundation model, is trained to generate high-resolution 3D CT scans and corresponding segmentation masks for 127 anatomic structures, including bones, organs, and tumors.

Generating a T1-weighted brain image (right) from FLAIR images (left) using synthetic image generation. FLAIR images are used to condition the generation of the T1-weighted images, which are very similar to the original ones. | Source

Med-Art is designed to generate medical images even when the training data is limited. It uses a diffusion transformer (DiT) to generate images from text prompts. By incorporating LLaVA-NeXT as a visual language model (VLM) to create detailed descriptions of the medical images through prompts and fine-tuning with LoRA, the model captures medical semantics more effectively. This allows Med-Art to generate high-quality medical images despite limited training data.

The architecture of the Med-Art model. LLaVA-Next is the used VLM to generate detailed descriptions. The model is fine-tuned with LoRA and uses a diffusion transformer (DiT) to generate the images. | Source

Despite their strengths, diffusion models face several limitations, including high computational demands, limited clinical validation, and limited generalizability. Moreover, most of the existing works fail to capture the demographic diversity (such as age, ethnicity, and gender), which may introduce biases in the downstream tasks.

Tabular data

Tabular data is one of the most important data formats in many domains, such as healthcare, finance, education, transportation, and psychology, but its availability is restricted due to data privacy regulations. Moreover, challenges like missing values and class imbalances limit its availability for machine learning models.

Synthetic tabular data generation is a promising direction to overcome these challenges by learning the distribution of the tabular data. We will discuss in detail the main categories for tabular data generation (GANs, diffusion, and LLM-based methods) and their limitations.

Synthetic tabular data generation pipeline. It includes different generation approaches, post-processing techniques for sample and label enhancement, and evaluation procedures measuring fidelity, privacy, and downstream model performance. |Ref

GANs

As discussed above, generative adversarial networks (GANs) consist of two neural networks: 1) a generator that generates synthetic data and 2) a discriminator that distinguishes the real data from fake ones. Both networks are trained simultaneously, where the generator adjusts its parameters based on feedback from the discriminator until the generated data is indistinguishable from the real one. Once trained, GANs can generate synthetic data from random noise.

In the case of tabular data generation, the architecture is modified to accommodate categorical features. For instance, TabFairGan uses a two-stage training process: first, generating synthetic data similar to the reference dataset, and then enforcing a fairness constraint to ensure the generated data is both accurate and fair. Conditional GANs like CTGAN allow conditional generation of tabular data based on feature constraints, such as generating health records for male patients. To ensure differential privacy protection during training, calibrated noise is added to the gradients during training, as it’s done in DPGAN. This mechanism ensures the individual cords cannot be inferred from the model.

Despite the progress in synthetic tabular data generation, these methods still face limitations. GAN-based methods often suffer from training instability, model collapse, and poor representation of multimodal distributions, leading to synthetic datasets that fail to reflect real-world complexity.

Diffusion models

Diffusion models generate synthetic data in two stages: a forward process that gradually adds noise to the data and a reverse (denoising) process that reconstructs the data step by step from the noise. Recent works have adapted this approach for tabular data. TabDDPM modifies the diffusion process to accommodate the structural characteristics of tabular data and outperforms GAN-based models. AutoDiff combines autoencoders with diffusion, encoding tabular data into a latent space before applying the diffusion process. This method effectively handles heterogeneous features, mixed data types, and complex inter-column dependencies, resulting in more accurate and structured synthetic tabular data.

Diffusion process (both training and sample phases) used to generate synthetic tabular data. During training, noise is gradually added to real data until the original structure is destroyed. During sampling, the model learns to reverse this process step by step to generate realistic synthetic tabular samples. |Ref

Domain-specific adaptation has also emerged. For example, TabDDPM-EHR applies TabDDM to generate high-quality electronic health records (EHRs) while preserving the statistical properties of original datasets. Similarly, FinDiff is designed for the financial domain, producing high-fidelity synthetic financial tabular data suitable for various downstream tasks, such as economic scenario modelling, stress tests, and fraud detection.

However, generating high-quality quality realistic tabular data in specialized domains such as healthcare and finance requires domain expertise. For example, synthesizing medical results for patients with heart disease requires knowledge that the probability of having heart disease increases with age. Most of the existing generative models learn only the statistical distribution of the raw data without adding specific domain rules. As a result, the synthetic data may match the overall distribution but violate logical and domain constraints.

LLM-based Models

Recently, large language models (LLMs) have been explored for generating synthetic tabular data. One common approach is in-context learning (ICL), which enables language models to perform tasks based on input-output examples without parameter updates or fine-tuning. This capability allows models to generalize to new tasks by embedding examples directly in the input prompt. By converting the tabular dataset into text-like formats and carefully designing the generation prompts, LLMs can synthesize synthetic tabular data.

For instance, EPIC improves class balance by providing LLMs with balanced and consistently formatted samples. However, directly prompting LLMs for synthetic tabular data generation may lead to inaccurate or misleading samples that deviate from user instructions.

Prompt-based and fine-tuning methods using LLMs to generate synthetic tabular data. Prompt-based generation relies on in-context examples and textual instructions, whereas finetuned models are specialized in tabular formats to produce more structured outputs. | Source

To overcome this limitation, recent works propose fine-tuning LLMs on tabular data, enabling them to better understand the structure constraints and relationships within tabular datasets. Fine-tuning ensures that the output aligns with real-data distributions and domain-specific knowledge. For example, TAPTAP pre-trains on a large amount of real-world tabular data and can generate high-quality tabular data for various applications, including privacy protection, missing values, limited data, and imbalanced classes. HARMONIC reduces privacy risks by fine-tuning LLMs to capture data structure and inter-row relationships by using an instruction-tuning dataset inspired by k-nearest neighbors. AIGT leverages metadata such as tabular descriptions as prompts paired with long-token partitioning algorithms, enabling the generation of large-scale tabular datasets.

Despite these advancements, LLM-based methods face several challenges. Prompted outputs are prone to hallucination, producing synthetic tabular data that include flawed examples, incorrect labels, or logically inconsistent values. In some cases, LLMs may even generate unrealistic or toxic instances, limiting their reliability.

Post-processing

As the distribution of tabular data is highly complex, it makes the synthetic tabular data generation very challenging for both non-LLM and LLM-based methods. To address this, many post-processing techniques have been proposed.

Sample enhancement post-processing methods try to improve the quality of the synthetically generated tabular data by modifying feature values or filtering unreasonable samples. Label enhancement post-processing methods try to correct potential annotation errors in the synthetically generated data by manually re-annotation of the mislabeled data. However, manual re-labeling is costly and impractical for large-scale data. To address this, many approaches rely on a proxy model, an automated model trained on real data, that can correct the labels in the synthetic dataset more efficiently.

Post-processing examples to improve the quality of synthetically generated tabular data. The process includes sample enhancement (refining generated samples) and label enhancement (correcting or regenerating target values). | Ref

Meta-learning

TabPFN is a leading example of a tabular foundation model trained entirely on synthetic data. The model is pretrained on millions of synthetic tabular datasets generated using structural causal models, which learns to predict masked targets from synthetic context. TabPFN adopts a transformer architecture, but not in the language-model sense. Instead of generating data like diffusion models or predicting the next token as LLMs do, it learns to model the conditional distributions across many small supervised learning tasks, effectively learning how to learn from tabular data.

Although TabPFN performs well on small to medium-sized datasets, it is not yet optimized for large-scale datasets. Its performance depends on the quality and diversity of synthetic pretraining data, and generalization can drop when real data differs from the simulated distributions. In such cases, gradient boosting and ensemble methods like XGBoost, CatBoost, or AutoGluon outperform TabPFN, making it best suited for data-limited or prototyping scenarios.

Pretraining and architecture of TabPFN. The model uses a transformer encoder adapted for two-dimensional tabular data and is pretrained on millions of synthetic datasets generated from structural causal models. This setup enables TabPFN to generalize across small-scale learning tasks. |Ref

Code generation

Code is one of the most used data formats across domains such as software engineering education, cybersecurity, and data science. However, the availability of large-scale, high-quality code datasets is limited. Synthetic code generation is a promising solution to expand training datasets and improve code diversity.

Large language models (LLMs) have demonstrated remarkable capabilities in code generation. Coding assistants such as GitHub Copilot, Claude Code, and Cursor can generate functions, complete scripts, or even entire applications from prompts.

Code Llama is an open-weight code-specialized LLM that generates code by using both code and natural language prompts. It can also be used for code completion and debugging. It supports many programming languages (Python, Java, PHP, Bash) and supports instruction tuning, allowing it to follow the developers’ prompts and style requirements.

A recent example, Case2Code, leverages synthetic input-output transformations to train LLMs for inductive reasoning on code generation. This framework incorporates LLM and a code interpreter to construct large-scale training samples. By focusing on functional correctness, it improves the ability of models to generalize.

Generating synthetic code using LLMs and a code interpreter. Left: A collection of raw functions serves as the source of the ground truth logic. Center: An LLM is used to generate example inputs. A code interpreter executes the raw function for these example inputs to obtain the corresponding outputs. Right: The generated input/output pairs are converted into natural language training prompts for code synthesis. | Source

Despite these advancements, synthetic code generation still faces limitations. LLMs often hallucinate, inventing functions or libraries that do not exist, and the generated code fails to run. However, the latter is also a key advantage of code over other data types, as it’s possible to automatically check whether the generated code compiles, passes unit tests. Thus, it’s possible to create an iterative feedback loop that improves quality over time. This self-correcting setup makes code generation one of the most practical areas for large-scale synthetic data creation and refinement.

What’s next for synthetic data

Synthetic data is not perfect, but it has become very valuable in domains where access to real-world data is limited, constrained, or insufficient to train foundation models. When used with an awareness of its limitations, synthetic data can be a powerful complement to real datasets, enabling advancements in many different domains.

Was the article useful?

More about Synthetic Data for LLM Training

Check out our product resources and related articles below:

We are joining OpenAI

Synthetic Data for LLM Training

What are LLM Embeddings: All you Need to Know

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Transition Hub

Train FM

State of Foundation Model Training Report 2025

Synthetic Data for LLM Training

TL;DR