WonderWorld

Interactive 3D Scene Generation from a Single Image

¹Stanford University
²MIT
^*Contributed Equally

CVPR 2025 (Highlight)

Interactive Scene Generation

WonderWorld allows real-time rendering and fast scene generation. This allows a user to navigate existing scenes, and specify where and what to generate a new scene. Here are examples where a user specifies scene contents (via text) and locations (via camera movement) to create a virtual world. Videos here are accelerated.

First-Person View

Fixed Bird-Eye View

Minecraft

Holy Spirit Cathedral

Ho Chi Minh City Hall

Explore Generated Worlds Yourself

Keyboard: Move by "W/A/S/D", look around by "I/J/K/L".
Touch Screen: Move by one-finger drag, look around by two-finger drag.
Note: Click an image to load the corresponding virtual world example. After loading, please click on the canvas to activate control. The rendering here is done on your device in real-time. Loading an example (~100MB) may take a while.

Rendering Generated Worlds

Here are some examples of generated worlds with different world layouts: rotational, curvy, and straight.

University Pathway

Minecraft

Venice

Approach

WonderWorld takes a single image as input and generates connected diverse 3D scenes to form a virtual world. Users can specify new scene contents and styles via text, and specify where to generate new scenes via camera movement as our system allows real-time rendering. Our system generates a single 3D scene in less than 10 seconds thank to our Fast LAyered Gaussian Surfels (FLAGS) representation that has two key designs: Firstly, our layered design requires only a single image to generate a scene, whereas existing scene generation methods require progressively generating multiple views. Secondly, our surfel design enables a geometry-based initialization, so that optimization is conceptually a "fine-tuning" that is much faster than other representations (e.g., NeRF and Gaussian Splatting) that need to optimize geometry from scratch.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene representations. We introduce the Fast LAyered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments.