|
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the self-supervised depth completion task by introducing an auxiliary objective: completing occluded regions of the scene. This auxiliary task serves as a strong inductive bias to guide the learning process for depth completion. Experimental results on the VOID1500 and NYUv2 datasets demonstrate that the proposed approach achieves superior performance compared to previous methods.
- The paper is clearly written and well organized.
- The proposed method is sound.
- The proposed method demonstrates superior performance compared to existing approaches on indoor datasets.
1. The main text includes comparisons only on two indoor datasets. Although KITTI Depth Completion results are reported in the supplementary material, the comparison involves only a limited number of competing methods. Moreover, the performance on the KITTI DC dataset appears inferior to several previous approaches, such as DesNet. It is recommended to provide a more detailed analysis of the results on outdoor datasets to better demonstrate the effectiveness and robustness of the proposed method.
1. How is the relative camera pose obtained? Is it predicted by a network or derived from ground-truth camera poses?
2. Besides occluded regions, there are areas that do not overlap between two frames. Would these non-overlapping regions affect the depth completion learning process?
3. The difficulty of scene completion is related to the time interval between frames, as a larger interval typically results in more occluded regions. How do you determine an appropriate frame interval to best assist depth completion learning?
4. Could you provide an analysis of the impact of the number of planes used for MPI on the overall performance? |
Lightly AI-edited |
|
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an unsupervised depth estimation method that augments depth estimation from the target view by introducing additional supervision from the source view. The main assumption is that by using features from the target view to estimate the depth of the source view, the target features can learn to model unseen structures, thereby regularizing the shape of visible structures. Experiments demonstrate notable improvements over previous methods on the VOID1500 and NYUv2 datasets.
1. The paper is mostly clearly written and includes proper illustrations.
2. Although reconstructing occluded 3D geometry is not a new concept, applying this idea to unsupervised depth completion is novel and interesting.
3. The proposed model achieves favorable improvements over existing methods.
1. Rationale of unseen geometry learning:
The rationale for using improved occluded geometry to enhance visible geometry is not clearly validated. While the authors claim effectiveness in Lines 60–64, the argument remains conceptual without concrete evidence. Since the model itself does not explicitly learn a “3D shape” of objects (L61), it is unclear whether it truly reduces reliance on input point density. Moreover, although the method claims that learning unseen geometry helps improve visible geometry, there are no quantitative or qualitative comparisons in the unseen regions.
2. Lack of ablation experiments:
- (a) Depth vs. feature supervision: It is unclear why the authors use feature-based supervision instead of depth-based supervision for adjacent views. Depth supervision would be a more direct and intuitive approach and would enable quantitative comparisons in occluded regions to justify the design choice. If depth supervision performs poorly, an explanation should be provided.
- (b) Computational analysis: The authors mention in L403 that the base model is KBNet with a transformer block. The added transformer head appears to contribute a significant performance gain (MAE: 39.8 → 35.3). This raises the question of whether the improvement is partly due to increased model capacity. A comprehensive comparison of #parameters, GFLOPs, and GPU memory usage among the proposed method, KBNet, and AugUndo is necessary.
- (c) ConteXt module ablation: Although the authors ablate 2D and 3D representations, they do not ablate the ConteXt module under the 3D representation, nor do they analyze the effect of its hyperparameters $(k_u, k_v, k_w)$. Since this module essentially performs feature pooling, it is important to evaluate how much it contributes to the final performance.
3. Unclear writing and typos:
- (a) In L241, the authors introduce $\bar{d}$ but do not explain how $\bar{X}$ is derived from $\bar{d}$.
- (b) In L409–L411, the sentence “(Row 5) This is worse than the proposed 3D warping without ORCaS loss (Row 3)” is inconsistent with the reported results, as Row 5 actually performs better than Row 3.
4. Missing related work on occluded scene reconstruction:
The following works should be cited and discussed for completeness:
[1] *Peeking Behind Objects: Layered Depth Prediction from a Single Image*
[2] *Layer-Structured 3D Scene Inference via View Synthesis*
[3] *Behind the Scenes: Density Fields for Single-View Reconstruction*
[4] *Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning*
[5] *Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction*
[6] *X-Ray: A Sequential 3D Representation for Generation*
[7] *LaRI: Layered Ray Intersections for Single-View 3D Geometric Reasoning*
[8] *RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion*
The following experiments and analyses are recommended for the revised version:
1. Replace feature-based supervision with depth-based supervision for adjacent views in the loss function. Analyze whether the predicted depths beyond the visible regions of the target view improve in the source view.
2. Under the 3D representation, ablate the ConteXt module and its hyperparameters $(k_u, k_v, k_w)$.
3. Provide a computational comparison (including #parameters, GFLOPs, and GPU memory) among the proposed method, KBNet, and AugUndo. |
Lightly AI-edited |
|
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel unsupervised framework that learns dense depth estimation from an RGB image and sparse point cloud by explicitly reasoning about occluded 3D regions. Rather than relying solely on photometric reconstruction of co-visible areas, the paper proposes to learn an inductive 3D bias through the auxiliary task of occluded region completion. ORCAS, a method of the paper, first encodes RGB and sparse depth into 2D features, broadcasts them into a discretized 3D volume, and rigidly warps this volume to an adjacent view using relative pose. The ConteXt block then fills in the empty voxels corresponding to occluded regions using nearby 3D context and positional embeddings, while a new ORCaS loss enforces consistency between predicted and real adjacent-view features. This occlusion-aware training significantly improves the performance of depth predictions in an unsupervised setting. Extensive experiments on VOID1500, NYUv2, and ScanNet show that ORCaS achieves state-of-the-art performance, outperforming previous unsupervised methods by up to 8.9% on average, while maintaining real-time inference speed and demonstrating strong robustness to domain shifts, calibration noise, and extremely sparse depth inputs
ORCaS introduces a simple yet novel idea, using occluded region completion as an auxiliary supervision signal for unsupervised depth completion.
This reframes depth completion from a purely visible-surface interpolation problem into a 3D reasoning task that requires understanding unseen geometry.
By leveraging occlusion as supervision, the method naturally learns a strong inductive bias that encourages consistent 3D representations.
This conceptual clarity and originality make the paper both theoretically appealing and practically impactful.
Across multiple benchmarks (VOID1500, NYUv2, ScanNet), the method consistently achieves state-of-the-art performance, outperforming previous unsupervised methods by up to 8.9% on average.
The proposed method learns latent features that encode the 3D shape regularities of indoor scenes, independent of texture or lighting. Even though the model is not directly trained for domain transfer, this implicit shape prior helps it perform well in zero-shot transfer and sparse-input settings.
Authors demonstrate strong robustness to variations in calibration, scene dynamics, and input sparsity.
It maintains stable performance even with +-30% synthetic calibration noise and when trained on static assumptions in dynamic environments.
Major weaknesses are as below:
Most experiments focus on indoor or small-scale environments (VOID1500, NYUv2, ScanNet).
The KITTI Depth Completion results are included only in the appendix, where the improvement over prior work is relatively small (≈3%). This suggests that the learned occlusion-based bias may generalize less effectively to outdoor, long-range, or high-depth-variance settings.
A broader evaluation would be necessary to confirm the scalability of the approach beyond indoor domains.
Although the method is built around the idea of learning from occluded-region completion, the qualitative results do not visually emphasize or analyze regions where occlusion is likely to occur. Figures 2 and 3 mainly show overall depth predictions for relatively frontal or fully visible areas, rather than viewpoints where depth discontinuities, inter-object occlusions, or self-occlusions are pronounced. Without explicitly highlighting or comparing, it is difficult to tell whether the proposed occlusion reasoning truly contributes to the improved depth quality.
Minor comments are as below:
In the ablation section, the text description around Table 2 incorrectly describes the relative performance between Row 3 and Row 5. The numbers in the table show that Row 5 performs better, but the text argued in the opposite.
The paper mentions that training is performed “in an alternating fashion” in L91-92, but provides no further explanation or details about what this process entails. There is no description of how the alternation is implemented, what modules are updated in each phase, or why this strategy is necessary.
A comparison with the baseline model, KBNet, are not presented in table 6. Following the KITTI benchmark performance gap between the proposed method and KBNet is very marginal.
Could you elaborate on why the proposed occlusion-completion supervision may generalize less effectively to outdoor environments?
Have you tested the method on any additional large-scale or high-depth-variance datasets to evaluate scalability beyond indoor domains?
Please explain how the alternating training process is scheduled (per batch, per epoch, or per iteration), which parameters are frozen in each phase, and why this two-step optimization was preferred over joint training.
Could you provide visualizations or case studies focusing specifically on occluded or partially visible areas? How can we confirm that the learned ConteXt block completes unseen regions rather than merely smoothing co-visible surfaces?
The current setup uses only two adjacent frames for occlusion-aware supervision. Have you explored extending ORCaS to longer temporal windows or multiple adjacent views? Incorporating multi-frame context might improve occlusion stability and reduce dependence on single-pose accuracy. Do you expect the current ConteXt block or ORCaS loss to generalize naturally to that setting?
It would be interesting to know whether ORCaS could serve as a pretraining stage for other 3D perception tasks such as monocular depth estimation or scene flow. Do you believe the learned occlusion-aware features transfer effectively to other geometry-related tasks? |
Fully AI-generated |
|
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents ORCaS, a new unsupervised depth completion method. The core idea is to treat occluded regions as a source of self-supervision to learn a stronger, 3D-aware inductive bias for reconstructing dense depth maps from sparse depth inputs and RGB images. Concretely, ORCaS broadcasts 2D features into a 3D voxel grid, performs rigid 3D warping using relative poses, predicts “empty” regions in adjacent views, employs a ConteXt block to extrapolate local contextual features, and introduces a new ORCaS loss that learns inductive priors from these occluded regions. Experiments on VOID1500, NYUv2, and KITTI demonstrate state-of-the-art performance.
1. Clear motivation. ORCaS introduces the novel concept of occluded region completion as a supervision signal for unsupervised depth learning. By explicitly predicting unseen regions, the method enforces the model to learn a 3D-structure-aware inductive bias that goes beyond traditional visible-region reconstruction.
2. Well-structured design. The architecture is modular, interpretable, and easily integrable with existing unsupervised depth completion frameworks. It can serve as a plug-and-play component for similar tasks.
3. Comprehensive validation. Extensive experiments across VOID1500, NYUv2, and KITTI datasets demonstrate consistent and significant performance gains. Ablation studies and transfer experiments further support the effectiveness of each design choice.
4. Strong generalization. By learning to predict occluded regions, ORCaS acquires a geometry-aware prior that improves cross-dataset transfer and remains robust even with extremely sparse depth.
1. Limited theoretical explanation of ORCaS loss. While the paper empirically demonstrates the effectiveness of occlusion-based supervision, it lacks a deeper theoretical analysis explaining why predicting unobserved regions improves the learned representation for visible depth estimation.
2. Dependency on accurate camera calibration. The method relies on precise camera intrinsics and relative poses. Although this limitation is acknowledged, the paper does not include ablation or robustness studies to quantify sensitivity to calibration noise.
3. Outdated related work. The literature review mainly covers works up to 2023. It is recommended to expand this section to include up‐to‐date publications, such as:
[1] Distilling Monocular Foundation Model for Fine-grained Depth Completion. CVPR 2025.
[2] Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion. CVPR 2025.
[3] OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration. ICCV 2025.
[4] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency. ICCV 2025.
[5] Tri-Perspective View Decomposition for Geometry-Aware Depth Completion. CVPR 2024.
I am willing to increase the rating if those weaknesses can be addressed in the rebuttal stage, thanks. |
Fully AI-generated |