|
Dreamland: Hybrid World Creation with Simulator and Generative Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Dreamland, a simulation-aware driving scene generation model, which at its core of design is a layered world abstraction representation that bridges grounded simuations and the real world.
At first, it constructs a grounded scene in the simulator and derive its layered form of abstraction, which is in the visual perspective. Then, this layered representation, containing a traffic-participent layer, a map-layout layer, and a background layer, is then processed by a trained instructional editing model, which hallucinates details in the background layer - making it closer to real-world semantics. At last, this refined layered representation is serving as condition to a fine-tuned photorealistic generation model.
They also include a dataset for training and evaluation, which is derived from another public driving scene dataset. Experiments and further extensions to other simulators and taks are also discussed and included.
**Interesting abstraction design.**
Although the layered scene concept can be originated to another paper Kimera [1], this layered world abstraction (LWA) in the autonomous driving literature can be seen as new. How this paper proposes to use this LWA as conditional signal for controlability is interesting.
**Clear paper writing and structuring.**
This paper is written in an easy-to-understand manner and the procedural pipeline is clear with mathematical expressions. Details on how each model is trained and how data is curated is also clear for reproducibility.
**Wide applicability for other domains.**
The core method and main experiments / applications remain autonomous driving centric, but the authors also provided a proof-of-concept verification on other domains like robotic manipulation in sec. 5.2 and supp. D.3. This shows the LWA representation is suitbale not solely in outdoor scenes but also indoor scenes, and the object-layout-background seems to be somehow transferable.
```
[1] Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs.
```
**Limited real-world verification.**
In sec. 5.4 and supp. D.5, the authors have fine-tuned a VLM for real-image-VQA, trying to show improved learning for enriched perception. But unfortunately, IMHO, this has nothing to do with *embodied agent*. VQA tasks are neither related to embodiments nor agents, which makes this expression inappropriate. The relevance to embodied-agent training remains speculative.
**Limited conceptual novelty.**
While the concrete LWA representation as a way of condition is new, the protocol of using a contidion to bridge privileged simulators to visual perception is well-discussed in recent research [1,2,3,4]. The core contribution is limited to the condition design of LWA itself, which serves mainly as a formal interface.
**Restricted validation domain.**
All main experiments center on autonomous-driving imagery and cross-domain demonstrations (robotics, indoor scenes) are qualitative only. The generalization claim of *universal hybrid world creation* lacks quantitative support.
**Evaluation focus on visual fidelity and controllability.**
Experiments show image quality and controllability metrics, but no metrics for physical consistency, temporal stability, or task-level realism are provided. The metrics being evaluated on do not imply benefits on real embodied agent training.
```
[1] Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models.
[2] LucidSim: Learning Visual Parkour from Generated Images.
[3] SimGen: Simulator-conditioned Driving Scene Generation.
[4] NVIDIA Omniverse Blueprint: Synthetic Manipulation Motion Generation for Robotics.
```
1. What are the exact digital form of the refined background layer? Are these arranged as pixels where each one of it is a semantic label from a predefined vocabulary? Or should they be a high-dimentional embedding, and if so, how is this supervised from the dataset perspective? Could you briefly explain the data structure of I/O for each stage in the pipeline?
2. For the extension to robotic manipulation, how is this LWA defined on indoor scenes? I suppose it still follows a object-layout-background design, but how are each layer defined specifically? Could you show me an example of how each item in Fig. 12 (b) correspond to layers?
3. Is there a particular reason why this pipeline cannot be verified with a real-world embodiment other than time-limitations for paper submission?
4. What is the philosophy behind LWA? Could you explain why the conditions needs to be designed in this way? If not structured as layers but other ways, I suppose the whole pipeline still works? I suppose we can still change background with instructional editing models? |
Fully human-written |
|
Dreamland: Hybrid World Creation with Simulator and Generative Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a pipeline to bridge simulators to generative models by designing layered world abstraction, which can be used to augment the training of embodied agent. The full pipeline utilizes editing models and conditional generative models for generation. This paper utilizes this pipeline to construct the D3Sim dataset for training and benchmarking Sim2Real transfer between simulators and generative models.
- This paper proposes layered world abstraction to bridge the gap of semantic maps generated by simulators and the ones generated by generative models.
- This paper constructs D3Sim dataset to facilitate Sim2Real transfer.
- This paper conducts evaluation on downstream tasks to assess effectiveness of DreamLand.
- The proposed pipeline heavily relys on models introduced in previous works with minor modifications. We expect authors to clarify more details about the innovation of this paper.
- This paper only conduct the quantative comparsion with other baseline methods based on image quality, except for SimGen, which is incomplete. We expect more results on controllability of other baseline methods, such as Panacea and MagicDrive.
see above. |
Fully human-written |
|
Dreamland: Hybrid World Creation with Simulator and Generative Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Dreamland, a hybrid world generation framework that integrates physics-based simulators with large-scale generative models for controllable world generation especially for driving scenarios. The paper proposes Layered World Abstraction (LWA), which is a core contribution bridging the simulator domain and real-world domain to achieve both physical controllability and photorealistic fidelity.
Aditionally, the paper proposes D3Sim, a curated large-scale dataset containing paird simulated and real-world driving scenarios for training and evaluation. Experiments on image quality, controllability, and downstream tasks demonstrate that Dreamland achieves significant improvements over previous baselines such as SimGen, achieving 52.3% lower FID and 17.9% better controllability. Dreamland also generalizes to new simulators and video generation with minimal retraining.
1. Strong quantitative gains: significant improvements in FID and controllability, outperforming state-of-the-art models.
2. Novel hybrid formulation: clear, elegant bridge between simulator and generative domains via LWA.
3. Scalability: the pipeline can pulg in newer pretrained models with minimal cost.
1. Compute and efficiency not detailed: training and inference costs for different model stages are not fully reported.
2. The paper could analyze the stability or convergence of LWA-Sim2Real adaptation more deeply.
3. While the framework is claimed to be general, experiments are mostly in driving scenes. A short cross-domain demo would strengthen the claim of universality.
1. How is temporal consistency handled in Dreamland beyond perfram LWA conditioning?
2. Is there any failure cases when the pretrained generative model's world knowledge conflicts with simulator conditions? |
Fully AI-generated |
|
Dreamland: Hybrid World Creation with Simulator and Generative Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Dreamland, a hybrid world generation framework that combines physics-based simulators with large-scale pretrained generative models for controllable scene creation. The key contribution is a Layered World Abstraction (LWA) that serves as an intermediate representation to bridge simulators and generative models. The approach involves three stages: (1) scene construction with a physics-based simulator, (2) Sim2Real transfer via instructional editing to align with real-world distributions, and (3) scene rendering with a large-scale pretrained generative model. The proposed pipeline is generalizable and scalable to unlock applications like video generation and scene editing. Additionally, the paper constructs a large-scale dataset called D3Sim for training and benchmarking hybrid generation pipelines combining simulators and generative models.
1. The proposed Layered World Abstraction (LWA) is a reasonable design enabling flexible control through preserved and editable regions, supporting diverse applications (scene editing, video generation, multiple simulators).
2. The proposed method achieves significant improvements compared to baseline (SimGen), 52.3% lower FID and 17.9% better si-RMSE. The experiments are comprehensive with quantitative metrics, user studies, ablations, and downstream task evaluations. The results in the demo video look fine.
3. D3Sim provides 60K paired samples with pixel-level alignment between simulation and real-world conditions for future research.
1. The technical novelty is limited. The pipeline primarily combines existing techniques (ACE++ for editing, standard diffusion adaptation). Innovation is mainly in integration and representation design rather than new methods.
2. Missing inference time comparisons with baselines. There are three stages, which might introduce additional processing time for generation.
3. Expensive Sim-Real paired data requirement. Stage-2 training requires costly pixel-aligned paired data. Unclear how to generalize to new simulators or domains without such data. Could we scale up the proposed method for more general usage, such as indoor scene generation?
4. Failure case analysis is absent. Where is the capability boundary of the proposed method? How could we achieve the best performance, and when will we fail?
5. For the multi-stage pipeline, the error accumulation problem and robustness were not analyzed.
6. The proposed method leverages a simulator to generate 3D scenes first as conditions for generative rendering, which is similar to previous 3D-conditioned generation methods, such as shape-for-motion [siggraph asia 2025], Image Sculpting [cvpr 2024], etc. And the idea of layered world representation is also widely used for scene generation, such LayerPano3D [siggraph 2025], HunyuanWorld 1.0, Scene4U [cvpr 2025], etc.
Refer to weakness. |
Fully human-written |