|
STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a plug-and-play module designed to enhance dynamic scene reconstruction in 3D Gaussian Splatting. It addresses the issue of spatio-temporal entanglement by learning separate spatial and temporal probability distributions for each Gaussian. Through a spatio-temporal mask, a decoupled deformation field, and consistency regularization, STDR effectively disentangles motion and structure.
1. I think decoupling spatial-temporal modeling makes sense, though I think this lacks novelty given that extensive approaches address this issue [A]. I think it is easy to find massive approaches for decoupling spatial-temporal modeling.
2. The proposed approach is plug-and-play, and can be applied to Deformable3DGS, SC-GS, SPGS on separated benchmarks.
[A] SDD-4DGS: Static-Dynamic Aware Decoupling in Gaussian Splatting for 4D Scene Reconstruction
1. My first concern is that this paper only presents a few approaches for plug-and-play evaluation. Meanwhile, those baselines are not representative enough for different lines of work. I think this approach is not appliable for all previous appoaches, could u explain which kinds of paper are suitable for STDR?
2. In 4DGS, some work illustrates the plug-and-play attribute. Could u discuss the differences? [C]
3. It is confusing that this does not apply the proposed technique to Spatial-Temporal Decoupling approaches, e.g. [D]
[C] TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction
[D] Hybridgs: Decoupling transients and statics with 2d and 3d Gaussian splatting
Overall, my major concern is the novelty and the evaluation of effectiveness. |
Fully human-written |
|
STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes STDR, a method to resolve the spatio-temporal incoherence that arises during canonical Gaussian initialization, when multi-frame observations are aggregated without temporal distinction. Specifically, under the per-scene optimization framework, STDR first assigns for each Gaussian a vector of temporal activation probabilities. The deformation field is modeled as a factorized deformation field that considers both spatial and temporal features for disentangled motion learning. Spatio-temporal consistency regularization is added on top to improve spatial and temporal smoothness. The paper conducts extensive experiments on synthetic (D-NeRF) and real-world (NeRF-DS, HyperNeRF) datasets, demonstrating consistent improvements in PSNR, SSIM, and LPIPS. The method is found to reduce ghosting artifacts and enhance temporal alignment.
- The paper is well-written and the proposed method is clearly explained.
- The proposed method is a plug-and-play module that can be easily integrated. When applied to different baseline methods (such as Deformable3D, SCGS, and SPGS), the method shows compatibility and effectiveness, consistently increasing the qualitative performance.
- The novelty of the proposed spatio-temporal probability distribution is somewhat limited. In related works such as 4DGS, the opacity of the Gaussians is also modulated by the temporal dimension, and the standard deviation reflects the persistence of the Gaussian. The proposed method seems to be a discretized version where a long vector whose length is proportional to the number of frames has to be introduced for every single Gaussian.
- Rendering efficiency is not reported. By introducing the spatio-temporal probability distribution, additional overhead is incurred by having to recompute the Gaussian opacity for each new timestamp. This would also lead to a potentially higher number of Gaussians, since the Gaussians could be simply turned opaque just at the timestamp where it gets observed, instead of building a canonical space and deforming over it. It would be interesting to visualize the 4D tracks of subsets of Gaussians.
- There are several typos in the paper. For example, in Eq.(1), the definition of the 3DGS should have $X-\mathcal{X}$ instead of just $\mathcal{X}$. Another example is on L231, "sptio" -> "spatio".
- It would be interesting to demonstrate the canonical space built by the model. If part of the geometry is not visible from one specific timestamp, will that part be visible in the final reconstruction by warping from other frames?
- What is the reconstruction and final rendering efficiency in comparison to other baseline methods?
- Will the method be able to reconstruct proper geometry in the scene dataset? This could be better illustrated by visualizing the depth maps reconstructed from the NeRF-DS dataset or in an interactive viewer. |
Fully human-written |
|
STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper tackles the `spatio-temporal incoherence` that arises when dynamic scenes are initialized as if all frames share one timestamp, yielding ghosted, temporally mixed Gaussians that confuse downstream deformation learning. The authors propose **STDR**, a plug-and-play spatio-temporal decoupling module built for 3D Gaussian pipelines: (i) a learnable spatio-temporal mask that evolves into a probability distribution over timestamps, (ii) a Separated Deformation Field that factorizes spatial structure and temporal dynamics using those distributions, and (iii) a warm-start consistency regularization to stabilize training and encourage smooth, coherent motion. Experiments on dynamic-scene benchmarks report consistent reconstruction gains (with reasonable FPS) and demonstrate that the learned masks are interpretable (uniform for static regions, sharp activations for dynamic ones).
- This paper is well-written and easy to understand.
- Introducing the Spatio-Temporal Mask is particularly elegant. It simultaneously characterizes the Gaussians’ persistence and captures spatial smoothness, a very sensible design choice.
- By exploiting the mask’s spatial and temporal cues, the Separated Deformation Field offers a neat decoupling that, in theory, enables the network to learn a cleaner flow.
- The empirical results are compelling. Owing to its plug-and-play design, the method is compatible with essentially all deformation-field–based frameworks. Quantitative evaluations exhibit consistent improvements, and the visualizations align with these improvements.
- Missing references for some important dynamic reconstruction works:
- [CVPR 2024] Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis, by Zhan Li et al.
- [CVPR 2025] FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction, by Yifan Wang et al.
- Missing comparison on FPS and primitive counts. It would help to include a direct comparison of the number of Gaussian primitives and the corresponding FPS to clarify the runtime trade-offs.
- Given the central role of the spatio-temporal mask, I strongly recommend adding visualizations to demonstrate effectiveness. Specifically, to show that the learned mask aligns with static vs. dynamic regions and reflects spatial smoothness.
- **Fixed temporal resolution.** The method relies on a $K$-dimensional time distribution per Gaussian, which may over-smooth or underfit scenes with highly non-uniform motion or missing frames. The authors could try to explore adaptive/continuous-time parameterizations and report sensitivity to $K$.
- The paper would benefit from a direct comparison between the proposed mask mechanism and FreetimeGS’s opacity-persistence scheme. It is an instructive contrast in my view, with the mask offering a notably more elegant formulation.
- Minor typo errors:
- L144-145: `is the color and the blending weight ` -> `are the color and the blending weight`
- L191-192: `Gaussianss` -> `Gaussians`
- L231-232: `sptio-temporal mask` -> `spatio-temporal mask`
- Figure 2: `Orginial Deformation Field` -> `Original Deformation Field`
- L321-315: `spation-temporal mask` -> `spatio-temporal mask`
- L136: `render the influence of xxx` -> ` renders the influence of`
- L498-499: `Section 4and Appendix A.` -> Add space between `4` and `and`
1. What exact values do you use for $M$ and $K$? Are they fixed across all datasets/scenes, or tuned per dataset? It can be much better if the authors can further provide a **sensitivity analysis** for $M$ and $K$
2. Why is **SP-GS** used as the baseline on NeRF-DS instead of **Deformable-GS** and **SC-GS**, which appear elsewhere in the paper? I guess the choice is motivated by performance/training-time/FPS considerations? |
Lightly AI-edited |
|
STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper extends deformable 3D Gaussian Splatting (3DGS) by introducing an additional temporal opacity mechanism to improve temporal modeling in dynamic scene reconstruction. The proposed STDR module combines a spatio-temporal mask, separated deformation field, and consistency regularization to explicitly decouple spatial and temporal components during training. The method achieves better quantitative and qualitative results than several baselines, including SP-GS, SC-GS, and DeformGS.
1. Performance Gains: The method improves reconstruction quality across both synthetic and real-world datasets, outperforming reported baselines such as SP-GS and DeformGS.
2. Integration: The module is **plug-and-play** and compatible with multiple existing 3DGS pipelines.
1. Inaccurate or Overgeneralized Claims
* The statement *“existing 3DGS-based methods typically adopt a two-stage pipeline”* is not universally true. Several recent dynamic Gaussian methods (e.g., 4DGS,) employ different initialization strategies.
* The claim that the proposed masks *“reflect the true dynamics of the scene”* is too strong — if the temporal sampling rate of training data is limited, this convergence cannot guarantee ground-truth dynamic fidelity.
* The assertion "first to introduce a spatio-temporal mask" may be overstated. The proposed mask effectively functions as a temporal modulation of Gaussian opacity, conceptually similar to prior opacity weighting or temporal blending in 4DGS's varaints.
2. Dataset Limitations and Overinterpretation:
* The synthetic scenes used (e.g., D-NeRF) consist of simple deformations or rigid motions defined in Blender, which are easily modeled by MLPs. Improvements on these datasets do not necessarily validate claims of “true dynamics modeling,”.
1. Real-world Extrapolation results:
Are real-world evaluations are performed on interpolated camera trajectories? if so, please include extrapolation tests to validate that the model generalizes beyond the temporal training window if it claims to capture true scene dynamics.
2. Runtime & Scalability Comparisons:
Report the training and inference time compared to SP-GS and DeformGS.
Clarify whether scaling up existing baselines (e.g., higher-capacity deformation fields) could achieve similar quality without additional temporal masking.
3. Real-time Demonstration:
If available, demonstrate integration into a GUI or visualization tool, rather than offline inference. |
Fully AI-generated |