Web Analytics
WonderWorld:

Interactive 3D Scene Generation from a Single Image

Interactive Scene Generation

WonderWorld allows real-time rendering and fast scene generation. This allows a user to navigate existing contents, and specify where and what to generate. Here are examples where a user specifies scene contents (via text) and locations (via camera movement).

First-Person View

Fixed Bird-Eye View

Generated Virtual World

Here are some examples of generated scenes with different camera path styles: rotational, curvy, and straight.

Interactive Viewing

Keyboard: Move by "W/A/S/D", look around by "I/J/K/L".
Touch Screen: Move by one-finger drag, look around by two-finger drag.
Note: After loading, please click on the canvas to activate control. The rendering here is done on your device in real-time. Loading a scene (~100MB) may take a while.

Approach

WonderWorld takes a single image as input and generates connected diverse 3D scenes to form a virtual world. Users can specify new scene contents and styles via text, and specify where to generate new scenes via camera movement as our system allows real-time rendering. Our system generates a single 3D scene in less than 10 seconds thank to our Fast LAyered Gaussian Surfels (FLAGS) representation that has two key designs: Firstly, our layered design requires only a single image to generate a scene, whereas existing scene generation methods require progressively generating multiple views. Secondly, our surfel design enables a geometry-based initialization, so that optimization is conceptually a "fine-tuning" that is must faster than other representations (e.g., NeRF and Gaussian Splatting) that need to optimize geometry from scratch.

Abstract

We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene representations. We introduce the Fast LAyered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments.

BibTeX

@article{yu2024wonderworld, title={WonderWorld: Interactive 3D Scene Generation from a Single Image}, author={Hong-Xing Yu and Haoyi Duan and Charles Herrmann and William T. Freeman and Jiajun Wu}, journal={arXiv:2406.09394}, year={2024} }