Loading...
Joint Embedding Predictive Architectures (JEPAs) and World Models have emerged as foundational paradigms advancing autonomous artificial intelligence (AI). This survey provides an integrated account of their theoretical foundations, methodological innovations, and empirical evidence. We synthesize how JEPAs leverage self-supervised latent prediction to learn semantic, transferable representations across modalities, while World Models construct internal simulations for planning and control via action-conditioned dynamics. By mapping architectural convergences, benchmarking performance, and analyzing deployment trends, we highlight the growing alignment between representation learning for perception and predictive modeling for sequential decision-making. The discussion spans JEPA variants and modern world models, showing open challenges in stability, uncertainty, and scalability, and identifies hybrid designs that unify perception and control. Overall, this work offers a comprehensive framework and critical outlook on integrating JEPAs and World Models as complementary substrates for general, robust, and efficient AI.
Joint-Embedding Predictive Architectures (JEPAs) learn by predicting representations of masked or future content from representations of visible context. That design sidesteps full generative reconstruction of high-dimensional signals while still encouraging models to internalize structure, dynamics, and semantics—a natural fit for world-model narratives in which an agent tracks how states evolve.
The figure below sketches how the field has branched by modality and objective after the early conceptual and image-focused work.

Figure 1: Chronological overview from the predictive world-model framing and hierarchical JEPA ideas through image (I-JEPA), stacked and audio variants, video and 3D instantiations, multimodal text–image JEPA, large-scale video world models, LLM-oriented JEPA, and robotics-flavored point-cloud applications.
The survey maps architectures (encoders, predictors, targets, stop-gradient and EMA teachers where used), modalities (image, video, audio spectrograms and waveforms, points, text-image, language), and objectives (pure latent prediction vs. combinations with variance–invariance–covariance style regularizers, masked targets, distillation, or discriminative side tasks). A recurring theme is what should be predicted in embedding space so that downstream planning, control, or retrieval benefits without paying the cost of pixel- or token-level decoding at scale.
Implementation choices—masking strategies, multi-scale hierarchy, dataset scale, and evaluation protocols—strongly affect whether representations transfer to segmentation, action recognition, robotics, or clinical pipelines. Open problems include tighter theory for what latent prediction implies about causality and identifiability, best practices for multimodal JEPA, and fair benchmarking against contrastive and masked-autoencoder baselines.
The paper is on SSRN (DOI 10.2139/ssrn.5772122). For a curated, updated list of papers, code, and tutorials, see awesome-jepa.