Realism and Causality in Synthetic Data Generation: A Survey

Synthetic Data Generation (SDG) often appears adequate by distributional tests yet fails under domain shift, privacy audits, or safety analyses because measurement physics and causal structure are absent. This survey introduces Realistic Artificial Data (RAD), also known as Realistic Synthetic Data (RSD), a causally grounded subset of SDG that encodes data-generating processes and measurement systems to support evaluation of fidelity, causal recoverability, stability, and sim-to-real transfer. We synthesize recurring SDG failures (mode collapse, privacy leakage, correlation distortion, sensor mismatch) and emerging solutions: physics-calibrated rendering, causal simulators and digital twins, constraint-aware tabular generators, and LLM-guided synthesis with domain constraints. We consolidate evaluation frameworks around fidelity-utility-causality-stability reporting and identify promising RAD applications in safety-critical domains (autonomy, healthcare) alongside critical gaps in standardization and governance.

Why realism and causality matter

Synthetic data is used to protect privacy, balance rare classes, and scale training when collection is expensive or risky. Classical SDG often optimizes for marginal or joint resemblance to a reference corpus. That can suffice for some tasks—but models and regulators increasingly ask whether synthetic records behave like real data under interventions, constraints, and shift. This survey argues for placing realism (plausible mechanisms and consistency) and causality (structure that supports counterfactual and interventional reasoning) at the center of how we design and evaluate generators.

RAD-style evaluation: more than “looks real”

The figure below uses a traffic intersection metaphor to separate positive and failure modes of synthetic data:

Representative synthetic samples cover the diversity of the real distribution; unrepresentative data collapses modes or omits subgroups.
Realistic placements respect scene logic and constraints; unrealistic samples violate physics or context.
Novel yet valid scenarios extend beyond memorized training points; non-novel outputs simply replay or overlay training exemplars.
Internal structure (for example, temporal activity) should be coherent with real dynamics; incoherent series miss peaks, trends, or causal timing.

RAD and SDG evaluation dimensions using a real-versus-synthetic traffic scene metaphor

Figure 1: Contrasting real (baseline) and synthetic agents across representativeness, realism, novelty, and temporal coherence—visual shorthand for evaluation criteria discussed in the survey.

From mimicry to Realistic Artificial Data (RAD)

Realistic Artificial Data (RAD) highlights generation that is causally aware, physically consistent, and privacy-safe, often via simulation, domain models, or hybrid pipelines that tie generative networks to explicit structure. The survey maps how the field moves from statistical mimicry and deep generative models toward simulator-in-the-loop, physics-informed, and structural causal approaches where assumptions are explicit enough to audit.

Methods, evaluation, and deployment

Methodologically, the paper threads together GANs, VAEs, diffusion and flow models, tabular and sequence synthesizers, LLM-driven generation, and agent- or rule-based simulators, with attention to identifiability and verification. On evaluation, it stresses multi-criteria assessment—fidelity, utility, fairness and coverage, robustness, and leakage risk—rather than a single score. Deployment discussions connect to privacy accounting, governance, and domain standards (health, mobility, finance, and beyond).

Resources and citation

The full survey is available on SSRN (DOI 10.2139/ssrn.5679762). For a living bibliography of tools, datasets, papers, and implementations aligned with SDG and RAD, see the curated awesome-sdg-rad repository.

Realism and Causality in Synthetic Data Generation: A Survey

Abstract

Realism and Causality in Synthetic Data Generation: A Survey

Abstract

Why realism and causality matter

RAD-style evaluation: more than “looks real”

From mimicry to Realistic Artificial Data (RAD)

Methods, evaluation, and deployment

Resources and citation

Why realism and causality matter

RAD-style evaluation: more than “looks real”

From mimicry to Realistic Artificial Data (RAD)

Methods, evaluation, and deployment

Resources and citation