A Survey on Joint Embedding Predictive Architectures and World Models

Joint Embedding Predictive Architectures (JEPAs) and World Models have emerged as foundational paradigms advancing autonomous artificial intelligence (AI). This survey provides an integrated account of their theoretical foundations, methodological innovations, and empirical evidence. We synthesize how JEPAs leverage self-supervised latent prediction to learn semantic, transferable representations across modalities, while World Models construct internal simulations for planning and control via action-conditioned dynamics. By mapping architectural convergences, benchmarking performance, and analyzing deployment trends, we highlight the growing alignment between representation learning for perception and predictive modeling for sequential decision-making. The discussion spans JEPA variants and modern world models, showing open challenges in stability, uncertainty, and scalability, and identifies hybrid designs that unify perception and control. Overall, this work offers a comprehensive framework and critical outlook on integrating JEPAs and World Models as complementary substrates for general, robust, and efficient AI.

Prediction in latent space, not in pixels

Joint-Embedding Predictive Architectures (JEPAs) learn by predicting representations of masked or future content from representations of visible context. That design sidesteps full generative reconstruction of high-dimensional signals while still encouraging models to internalize structure, dynamics, and semantics—a natural fit for world-model narratives in which an agent tracks how states evolve.

The figure below sketches how the field has branched by modality and objective after the early conceptual and image-focused work.

Timeline of JEPA variants and milestones from 2022 through 2025

Figure 1: Chronological overview from the predictive world-model framing and hierarchical JEPA ideas through image (I-JEPA), stacked and audio variants, video and 3D instantiations, multimodal text–image JEPA, large-scale video world models, LLM-oriented JEPA, and robotics-flavored point-cloud applications.

Taxonomy and world models

The survey maps architectures (encoders, predictors, targets, stop-gradient and EMA teachers where used), modalities (image, video, audio spectrograms and waveforms, points, text-image, language), and objectives (pure latent prediction vs. combinations with variance–invariance–covariance style regularizers, masked targets, distillation, or discriminative side tasks). A recurring theme is what should be predicted in embedding space so that downstream planning, control, or retrieval benefits without paying the cost of pixel- or token-level decoding at scale.

Practice and open directions

Implementation choices—masking strategies, multi-scale hierarchy, dataset scale, and evaluation protocols—strongly affect whether representations transfer to segmentation, action recognition, robotics, or clinical pipelines. Open problems include tighter theory for what latent prediction implies about causality and identifiability, best practices for multimodal JEPA, and fair benchmarking against contrastive and masked-autoencoder baselines.

Resources

The paper is on SSRN (DOI 10.2139/ssrn.5772122). For a curated, updated list of papers, code, and tutorials, see awesome-jepa.

A Survey on Joint Embedding Predictive Architectures and World Models

Abstract

A Survey on Joint Embedding Predictive Architectures and World Models

Abstract

Prediction in latent space, not in pixels

Taxonomy and world models

Practice and open directions

Resources

Prediction in latent space, not in pixels

Taxonomy and world models

Practice and open directions

Resources