|
Delta-Triplane Transformers as Occupancy World Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Delta-Triplane Transformers (DTT), a 4D Occupancy World Model (OWM) for autonomous driving, aiming to address three key limitations of existing OWMs: loss of vertical spatial information, long-term prediction error accumulation, and excessive model complexity. The core designs of DTT include: 1) A pretrained triplane autoencoder that compactly preserves spatial information of 3D occupancy; 2) Multi-scale Transformers that predict "triplane deltas" (instead of full occupancy states) to leverage the sparsity of changes for reduced modeling difficulty; 3) A sparse query-based motion planning module designed using these deltas to simplify the decision-making process. Experiments on the nuScenes and Occ3D datasets validate DTT’s superiority: compared to the state-of-the-art (SOTA) method DOME, DTT improves mean IoU (mIoU) from 27.10% to 30.85% and IoU from 36.36% to 74.58, while achieving real-time inference on an RTX 4090 GPU.
1. **Clarity of Writing:** The paper is very clearly structured and written, allowing readers to easily follow the authors' reasoning about the DTT method and its components.
2. **Clear Motivation:** The motivation for DTT is well-justified—by adopting incremental modeling and leveraging sparsity (via delta prediction), the method significantly reduces computational burden compared to full-state prediction approaches.
3. **Relevance to Research Needs:** The research addresses a topic of high interest in autonomous driving. The achievement of real-time inference on an RTX 4090 makes DTT more relevant to real-world deployment than slower baselines.
4. **Compelling Experimental Results:** The experiments clearly demonstrate that DTT improves both computational efficiency and core performance metrics, avoiding the common trade-off between speed and accuracy.
1. **Insufficient Theoretical Analysis of Methodology:** The paper primarily uses experimental results to validate its design choices—for example, pretraining the triplane model first and then fixing the encoder to train DTT and the decoder. While Table 4 shows that omitting pretraining degrades performance, the authors lack an intuitive theoretical analysis to explain why this is the case. For instance, Why does end-to-end training (without separating pretraining and fine-tuning) not yield better results? Or is end-to-end training too challenging (e.g., due to high dimensionality or unstable gradients) and prone to local convergence?
2. **Incomplete Analysis of Cross-Input Performance:** From Table 1, DTT achieves significant performance gains when the input is 3D occupancy ground truth (3D-Occ). However, when the input is replaced with camera-derived data (Camera), the improvement margin shrinks substantially—for example, DOME-F outperforms DTT-F significantly at 0s and 1s. The authors need to provide a detailed explanation for this discrepancy (e.g., whether camera-based 3D occupancy predictions introduce noise that disproportionately affects DTT’s delta-based modeling).
3. **Cumulative Error Risks:** DTT adopts an autoregressive framework that predicts the next frame using historical information. This raises a critical question: Does DTT still suffer from cumulative error over time? If the model is tasked with predicting longer occupancy sequences (e.g., beyond 3 seconds), will severe prediction drift occur? The current experiments only evaluate up to 3 seconds, and no analysis of long-horizon stability is provided.
4. **Motion Planning Performance Trends:** In terms of collision rate (Table 2), DTT-O performs significantly worse than OccLLaMA-O at 1s and 2s but outperforms it at 3s. The authors need to explain this counterintuitive trend. For example, is DTT less effective at short-term (near-future) frame prediction, and if so, what causes this delay in performance improvement?
Please refer to the "Weaknesses" section for detailed questions and suggestions. |
Lightly AI-edited |
|
Delta-Triplane Transformers as Occupancy World Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Delta-Triplane Transformers (DTT), a novel occupancy world model designed for autonomous driving. The core idea is to represent the 3D environment using triplane latent features and to model occupancy changes (deltas) over time rather than full occupancy states. By leveraging triplane representations, the method preserves vertical (z-axis) structural information that conventional BEV-based approaches tend to lose, avoiding the need for fine-grained semantic cues or large model capacity. Moreover, instead of predicting the entire occupancy state, DTT focuses on modeling occupancy changes (deltas). Since these deltas are sparser and more concentrated around zero, they exhibit lower variance and simplify the learning process. Experiments on the Occ3D and nuScenes datasets demonstrate that DTT achieves SOTA results in both occupancy forecasting and motion planning tasks.
1.Comprehensive methodology and clear architecture design.
The paper provides a detailed explanation of the encoder, decoder, delta-based occupancy predictor, and motion planner, as well as how these modules interact within the overall framework. The description of the temporal triplane prediction module, in particular, is well-articulated and technically sound.
2.Novel representation design.
The adoption of triplane as an intermediate latent representation is novel and effective. It addresses key limitations of existing BEV-based occupancy world models, which often lose vertical geometric information and rely on large feature maps for compensation.
3.Impressive results with lightweight architecture.
DTT achieves impressive performance on both Occ3D and nuScenes benchmarks, outperforming current occupancy-based methods while maintaining a smaller model size. This demonstrates strong potential for real-time and practical deployment.
1.Limited novelty in delta prediction.
Predicting changes (deltas) over time is not a new concept in temporal forecasting. The paper should clarify what makes delta prediction within the triplane representation particularly advantageous compared to delta prediction in other representations.
A suggested ablation study could compare variants such as OccLLaMA or OccWorld, where these methods also predict deltas instead of full states, to isolate the contribution of the triplane-delta combination.
2.Limited discussion on limitations and robustness.
The paper argues that occupancy deltas are sparse and thus easier to learn. However, this assumption might not hold under adverse conditions such as rain, snow, or dense sensor noise, where occupancy changes are no longer sparse. A brief discussion or empirical evidence regarding DTT’s robustness in such noisy or highly dynamic scenarios would strengthen the paper.
3.Inconsistency in figure section titles.
In Section 4.2, the heading “Visual comparisons with SOTA” is inconsistent with the following subsection title “Visual comparisons of motion prediction.” It would be clearer to use a consistent title such as “Visual comparisons of occupancy prediction.” In addition, the visualizations (e.g., Figure 6) could be enlarged to make the differences between methods more clear.
See above. |
Fully AI-generated |
|
Delta-Triplane Transformers as Occupancy World Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Delta-Triplane Transformers (DTT), a new 4D occupancy world model (OWM) for autonomous driving. Unlike previous works (e.g., DOME, OccWorld), DTT does not predict the full occupancy state but instead models changes (deltas) in a compact triplane representation (xy/xz/yz). By leveraging separate multi-scale Transformers per plane, DTT predicts sparse occupancy changes and fuses them to reconstruct future scenes and ego trajectories. The method achieves state-of-the-art results in motion planning and 3D-Occ-based 4D occupancy prediction and runs in real time (26 FPS).
1. Triplane representation preserves vertical 3D information and yields a compact latent space, helping reduce drift in long-term prediction.
2. Modeling occupancy deltas is intuitively efficient and effective. As we do not need to "copy" the existing states into the prediction.
3. State-of-the-art results: consistent improvement over DOME, OccWorld, and OccLLaMA, both in accuracy (mainly in 3D-Occ based 4D occupancy prediction) and efficiency.
4. The experiments and supplementary materials are rich. The writing and drawings overall is clear.
1. Lack of analysis about why learning the "changes" is easier. Firstly, sparse doesn't equals easier (line 085). Then, the full state isn't just hard to learn, but for works like DOME, their error accumulation may additionally comes from the exposure bias. It's intuitive but if we compare the results in Table 1 and Table 3's w/o triplane changes, we can see the mIoU is inferior than DOME. So maybe the slower error accumulation achieved in Table 1, compared with DOME, mainly comes from the triplane representation, rather than the delta estimation?
2. In line 455-456. the author claimed that predicting everything in parallel hinders autoregressive error correction. It's my first time to hear "correction" rather than "accumulation". More discussion about this is welcomed.
3. Lack of analysis about Table 1's camera-based 4D occupancy forecasting. Why DTT is clealy inferior than DOME?
1. Why compare only the OccWorld in your qualitative experiments and in your supplementary materials? The reviewer considers DOME is a better choice, as its the current state-of-the-art. |
Fully human-written |
|
Delta-Triplane Transformers as Occupancy World Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes Delta-Triplane Transformers (DTT), a 4D occupancy world model that represents scenes via a compact temporal triplane latent and predicts future states by modeling sparse deltas instead of full occupancy, which are then decoded for occupancy forecasting and used as sparse queries for motion planning. DTT pretrains an autoencoder to obtain a triplane latent, applies plane-specific multi-scale Transformers to predict future deltas autoregressively, and couples these with a planning module that attends to change features and history to output trajectories.
1. Modeling occupancy deltas in a triplane latent exploits sparsity, reduces variance, and enables lighter sequence models while preserving vertical structure versus BEV-only latents.
2. The performance is better than previous occupancy world models.
1. The paper claims improvements in motion planning but neglects much of the related work. It does not situate contributions against recent intention-aware or end-to-end planning approaches such as World4Drive, BEV-Planner, GenAD, and PPAD.
2. NuScenes is a small dataset. Would it be possible to evaluate the approach on a larger dataset, such as Waymo [4], or perform the motion planning experiments on NAVSIM?
3. Qualitative results are limited. More diverse and challenging scenarios and explicit failure-case analyses would make the qualitative story more convincing.
4. In Figure 6’s first example, the 2.0s–2.5s frames for the middle cars appear to exhibit stretching/residual artifacts, and in places, OccWorld looks cleaner.
1. How does DTT perform under closed-loop evaluation and distribution shift (e.g., NavSim or nuPlan), and does the delta modeling reduce compounding error in closed-loop rollouts relative to full-state predictors?
2. Figure 4 caption contains a “traiplane” typo.
3. Why is the reconstruction performance of the camera input setting in the last row in Table 1 lower? |
Lightly AI-edited |