Loading...
Synthetic Data Generation (SDG) often appears adequate by distributional tests yet fails under domain shift, privacy audits, or safety analyses because measurement physics and causal structure are absent. This survey introduces Realistic Artificial Data (RAD), also known as Realistic Synthetic Data (RSD), a causally grounded subset of SDG that encodes data-generating processes and measurement systems to support evaluation of fidelity, causal recoverability, stability, and sim-to-real transfer. We synthesize recurring SDG failures (mode collapse, privacy leakage, correlation distortion, sensor mismatch) and emerging solutions: physics-calibrated rendering, causal simulators and digital twins, constraint-aware tabular generators, and LLM-guided synthesis with domain constraints. We consolidate evaluation frameworks around fidelity-utility-causality-stability reporting and identify promising RAD applications in safety-critical domains (autonomy, healthcare) alongside critical gaps in standardization and governance.
Synthetic data is used to protect privacy, balance rare classes, and scale training when collection is expensive or risky. Classical SDG often optimizes for marginal or joint resemblance to a reference corpus. That can suffice for some tasks—but models and regulators increasingly ask whether synthetic records behave like real data under interventions, constraints, and shift. This survey argues for placing realism (plausible mechanisms and consistency) and causality (structure that supports counterfactual and interventional reasoning) at the center of how we design and evaluate generators.
The figure below uses a traffic intersection metaphor to separate positive and failure modes of synthetic data:

Figure 1: Contrasting real (baseline) and synthetic agents across representativeness, realism, novelty, and temporal coherence—visual shorthand for evaluation criteria discussed in the survey.
Realistic Artificial Data (RAD) highlights generation that is causally aware, physically consistent, and privacy-safe, often via simulation, domain models, or hybrid pipelines that tie generative networks to explicit structure. The survey maps how the field moves from statistical mimicry and deep generative models toward simulator-in-the-loop, physics-informed, and structural causal approaches where assumptions are explicit enough to audit.
Methodologically, the paper threads together GANs, VAEs, diffusion and flow models, tabular and sequence synthesizers, LLM-driven generation, and agent- or rule-based simulators, with attention to identifiability and verification. On evaluation, it stresses multi-criteria assessment—fidelity, utility, fairness and coverage, robustness, and leakage risk—rather than a single score. Deployment discussions connect to privacy accounting, governance, and domain standards (health, mobility, finance, and beyond).
The full survey is available on SSRN (DOI 10.2139/ssrn.5679762). For a living bibliography of tools, datasets, papers, and implementations aligned with SDG and RAD, see the curated awesome-sdg-rad repository.