|
HOIGS: Human-Object Interaction Gaussian Splatting from Monocular Videos |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes HOIGS, a 3D Gaussian Splatting framework that explicitly models human–object interactions from monocular video. The contributions are twofold: (i) dual deformation baselines—HexPlane + LBS for humans and Cubic Hermite Splines (CHS) for moving objects; and (ii) an HOI module with mutual cross-attention and a distance mask that fuses human/object motion features to produce interaction-aware refinements (∆SMPL-X and object-Gaussian offsets). For evaluation, the method is compared against human-centric and 4DGS baselines on three datasets—HOSNeRF, BEHAVE (single-view adaptation), and ARCTIC (selected full-body cases)—and ablations demonstrate gains from CHS and the HOI module. The reported training time is ~5 hours per scene on a single H100.
1. Explicit HOI modeling. The mutual-attention HOI module with a distance mask is a principled way to encode interaction-driven deformations and improves stability near contact/manipulation.
2. Clear architecture-to-loss mapping. The paper specifies losses for human/scene/object (with weights γ, β, σ) and includes depth supervision to constrain human-geometry scale.
3. Consistent empirical gains. Stronger PSNR/LPIPS than human-centric and 4DGS baselines on HOSNeRF, with similar improvements on BEHAVE and ARCTIC; ablations support the contributions of CHS, time-varying human features, and the HOI module.
1. Lack of a quantitative study of how object diffusion prior affects reconstruction quality would strengthen claims.
2. Metric breadth (interaction & geometry). Evaluation is limited to PSNR/LPIPS. There are no metrics for contact quality, penetration, temporal consistency, or pose–object alignment, which are central to HOI; geometry-oriented metrics are also absent. Consider adding penetration/contact measures, temporal SSIM, keypoint/object-distance errors, and basic geometry metrics (e.g., silhouette IoU, depth error, Chamfer/F-score when available).
3. Use of segmentation masks at evaluation time? The paper states that pre-trained human/object segmentation masks are used to form training losses (scene/object). Are masks also used at test/evaluation (e.g., masked PSNR/LPIPS)? If yes, are they dataset-provided or predicted (please specify the model/training data/thresholds and release code)? If not, please clarify that evaluation is full-image without masks.
4. Use of SMPL-X at test time. From the text it seems per-frame SMPL-X is initialized by a regressor and then optimized during training (ExAvatar-style), and inference reuses these optimized poses for rendering. Could you confirm no additional SMPL-X regression is run at evaluation time, and specify which regressor/versions are used for the initial fits?
5. Camera-pose fragility (acknowledged). The paper notes failures under minimal camera motion due to COLMAP instability; joint pose optimization is suggested but not attempted. Why not evaluate on dense-view HOI datasets (e.g., NeuralDome)? A single-view-train, multi-view-eval protocol on a dense dataset (train on one camera, evaluate on held-out views) would (i) keep the monocular-training assumption, (ii) stress-test view generalization and geometry, and (iii) mitigate the COLMAP fragility you acknowledge. If infeasible, please clarify the constraints (e.g., licensing, preprocessing, pipeline mismatch).
6. Empty/expired anonymous website link. The provided link loads an empty page. For a rendering- and video-based task, it’s difficult to assess result quality without videos—please restore the link and include representative clips.
1. Major questions: Please see the items listed under Weaknesses / Concerns above.
2. Minor questions:
a. Line 188: What exactly is the SDS loss here (objective, weighting, implementation details)?
b. Line 194: What is the dimensionality of G_k?
c. Line 238: Please specify the dimensionality of \theta. More generally, several formulas/variables are missing dimension annotations—could you add them for completeness? |
Lightly AI-edited |
|
HOIGS: Human-Object Interaction Gaussian Splatting from Monocular Videos |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces HOIGS for monocular human–object interaction. Humans are modeled with hexplane + LBS, while object motion follows CHS, coupled with a cross-attention HOI module. Objects are initialized from a representative frame using a diffusion prior with SDS, then warped to keyframes. The human branch uses SMPL-X with a COLMAP-scaled depth term. The HOI module employs a distance mask B to exchange cues between humans and objects.
1.Experiments are extensive across several datasets, with solid qualitative results.
2.Using an explicit human model with SMPL-X and an interaction module is intuitive
1.Figure 2 is under-explained. The role of segmentation and the process by which diffusion + SDS yields 3D (and “warped”) Gaussians need clear, step-by-step exposition.
2.The method relies heavily on priors (diffusion model, depth estimation, segmentation). This may affect the fairness of comparisons to baselines if not controlled or ablated.
3.Training time is reported (H100, ~5 hours/scene), but there is no runtime or training-time comparison against baselines.
see weakness |
Lightly AI-edited |
|
HOIGS: Human-Object Interaction Gaussian Splatting from Monocular Videos |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the challenging and highly relevant problem of reconstructing dynamic scenes involving complex human-object interactions (HOI) from monocular video. The authors correctly identify a critical gap in the existing literature, which is largely bifurcated into two distinct approaches: (1) human-centric 3D Gaussian Splatting (3DGS) methods that achieve high-fidelity human avatar reconstruction but largely ignore or fail to model dynamic objects, and (2) general-purpose 4DGS methods that attempt to model all moving entities with a single, unified motion field, often resulting in visual artifacts and an inability to capture the nuanced dynamics of physical contact and manipulation.To bridge this gap, the paper introduces Human-Object Interaction Gaussian Splatting (HOIGS), a framework designed to explicitly model the interplay between humans and objects. The core of the proposed method is a decomposition of the problem. It employs heterogeneous deformation models tailored to the distinct characteristics of each entity. Human motion is represented using a hexplane-based canonical avatar, which is deformed into the posed world space via Linear Blend Skinning (LBS) guided by pre-estimated SMPL-X parameters. In contrast, object motion is modeled using an explicit, trajectory-based Cubic Hermite Spline (CHS) that interpolates the positions and velocities of keyframe Gaussians over time.The central claimed innovation of HOIGS is a dedicated HOI module that reconciles these two independent motion models. This module leverages a mutual cross-attention mechanism to capture the bidirectional dependencies between human and object features. By processing time-varying features extracted from 16 distinct human body parts and the object's velocity embeddings, the module predicts fine-grained corrective offsets for both the human pose (ΔSMPL-X) and the object's Gaussian positions (ΔG_object). This explicit modeling of interaction allows the framework to enforce motion consistency and physical plausibility in regions of close contact.The efficacy of HOIGS is demonstrated through extensive experiments on three benchmarks: HOSNERF, BEHAVE, and ARCTIC. The quantitative results show that the proposed baselines is preferable to a wide array of state-of-the-art baselines, including both specialized human-centric models and general 4D scene reconstruction techniques, across standard image-based metrics (PSNR, LPIPS). The numbers are supported by qualitative comparisons and an ablation study.
The entity‑aware cross‑attention HOI module that exchanges information between human and object streams is a clear conceptual step beyond (i) human‑only reconstructions and (ii) “single motion field” 4DGS approaches. Using distinct baselines (hexplane for humans, CHS with learned tangents for objects) is a thoughtful design that recognizes different motion statistics and priors. (Sec. 3; Fig. 2 p.4).
Quality.The technical pipeline is well specified, with explicit formulas for CHS interpolation, attention, and the integrated training objective (Eqs. (1)–(9), (13)–(20)). Object velocities as learnable tangents and depth‑guided supervision for human scale refine geometry and motion fidelity (Sec. 3.1–3.3). Ablations show each component matters: replacing CHS with an MLP hurts PSNR by ~0.5; removing HOI drops ~0.65 PSNR (Table 4).
Quantitative breadth. Comparisons cover NeRF‑based baselines, human‑centric 3DGS models (ExAvatar), and 4DGS variants; results are strong on three public datasets with scene‑wise breakdown (Tables 1–3; pp. 6–9).
Clarity. Clear diagrams for the full pipeline, human/object feature construction, and HOI block (Figs. 2, 6–8). The appendix explains the 16‑part feature tokens and the attention masking with gains at contact and manipulation regions where many methods struggle (qualitative examples in Figs. 3–5). The architecture is compatible with established human priors (SMPL‑X) and 3DGS, making it a promising drop‑in upgrade for interaction‑heavy scenes.
Limited Technical Novelty: The main idea seems to be to reconstruct the human separately, the object separately and then to combine them with some clever tricks.
Missing Baselines: There are no comparisons with baselines that use a 2D map + CNN formulation (AnimatableGaussians etc) for which source code is available.
The paper Mir et al. - GASPACHO is not referenced even though it addresses a similiar problem
Ambiguity around the diffusion prior and fairness of comparisons.
The object initialization uses a diffusion prior with SDS from a “representative frame,” but the paper does not specify the exact model, guidance setup, or how often baselines benefit from comparable priors (Sec. 3.1). Since this prior can inject strong shape cues, fairness would improve by (a) standardizing priors across methods or (b) reporting results without the diffusion prior.
Evaluation scope and metrics at interaction regions.
While PSNR/LPIPS are reported, there is no metric that focuses on contact fidelity (e.g., penetration/float, hand–object distance statistics) or human pose accuracy (MPJPE/PVE) on BEHAVE or ARCTIC. Given the paper’s motivation, region‑specific metrics would substantiate the claimed contact consistency (Sec. 4.3–4.4; Figs. 3–5).
Heavy reliance on external modules without sensitivity analysis.
Results depend on (i) SMPL‑X regressors, (ii) segmentation for humans/objects (used in losses), and (iii) metric depth estimation scaled by COLMAP (Eq. (5), Scene/Object loss details pp. 14–15). The paper does not quantify sensitivity to errors in these modules or align choices across baselines.
Computational and memory cost of the HOI attention.
The object tokens are per‑Gaussian features (Appendix 6.1–6.2). Even with 32‑dim embeddings, cross‑attention between 16 human part tokens and object Gaussians can be heavy; the paper gives training time but not inference FPS or memory footprints vs 4DGS/ExAvatar. A complexity analysis and timing table are missing (Sec. 4.1).
Design choices need deeper ablations.
The distance mask is described qualitatively (relative distances) but its exact form, scaling, sparsification, and effect are not ablated.
The choice of 16 body parts and key‑frame interval = 4 (Sec. 4.1) seems fixed; there is no study of granularity vs. quality/speed.
Only one CHS parameterization is tested; alternatives such as per‑object SE(3) + per‑Gaussian residuals or adaptive knot placement could be competitive.
Limitations around low‑baseline videos remain open.
The paper notes failure modes when camera motion is minimal (COLMAP degradation) and suggests joint pose optimization as future work (Conclusion, p. 9). It would help to include at least a small experiment demonstrating how much performance drops and whether the HOI module mitigates it.
My main concerns are about limited technical novelty and missing 2D map + CNN comparisons - there are clear blurry artifacts in the final results.
Could the authors explain why the main decision to use a feature based representation for 3D human and object reconstruction and not a 2D map formulation. The 2D map formulation has been shown to clearly outperform a feature based formulation and I find this design decision baffling.
It seems that the authors chose to start off from HUGS, ExAvatar as their baseline and develop their method from there. It would have made more sense, in my opinion, to start off from a baseline that uses a 2D map formulation as the starting point and iterate from there.
As such I am inclined towards rejection |
Fully AI-generated |
|
HOIGS: Human-Object Interaction Gaussian Splatting from Monocular Videos |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a method for reconstructing human-object interaction based on the popular 3D Gaussian Splatting from monocular videos. Existing 3DGS methods fail to reconstruct interaction area as the interaction area, which is critical component of reconstructing human-object interaction. The method utilizes hexplane for human and Cubic Hermite Spline for object and propose HOI module based on cross-attention to model the human-object interaction. This leads to better performance in human-object interaction datasets like BEHAVE.
1. The paper addresses critical problem of the inability to accurately reconstruct human-object interaction by existing 1) general 3DGS methods, 2) human-oriented 3DGS methods. The qualitative results demonstrate that their approach solves this problem meaningfully.
2. The methodology section is detailed enough for researchers that are familiar with 3DGS or 3D human reconstruction to implement most of the components of the paper based solely on the paper.
3. The performance gap between existing SOTA methods and HOIGS is significant in highly interacting datasets like BEHAVE.
1. The qualitative results in the paper only show the aggregate 3D reconstruction of the human, object, and scene. However, it seems highly likely that the 3D human or object components may not be well reconstructed in occluded regions. Qualitative visualizations focusing solely on the 3D human and object, or even only the 3D human (as in Figure 8 of ExAvatar), would be critical to properly assess the performance of HOIGS.
2. The paper describes that the 3D object is first reconstructed using a diffusion prior based on SDS loss (similar to Zero-123), and this initial model is used as the initialization for 3D Gaussians. This pipeline appears rather complex and may not be the most straightforward approach for 3D object reconstruction. Could the authors clarify why this particular design was chosen? It would strengthen the paper to include comparisons with alternative approaches, such as SDF-based methods (e.g., Vid2Avatar) or point-based methods (e.g., Dust3r), specifically for the object reconstruction task.
3. Section 3.1 introduces a human deformation module. What would happen if this component were replaced with ExAvatar? Is there a reason the authors did not directly adopt ExAvatar for human deformation?
4. It would be interesting to see whether the contact-based masking strategy from CONTHO (CVPR 2024, Nam et al.) could improve performance compared to the current cross-attention mechanism in HOIGS. In particular, the CRFormer module in CONTHO might serve as a drop-in replacement for the HOI module.
5. The paper mentions a “distance mask B” in the HOI module. Could the authors elaborate on how this mask is generated and what its specific purpose is in the cross-attention computation?
6. A quantitative comparison with existing 3D human–object reconstruction methods such as PHOSA (ECCV 2020, Zhang et al.), CHORE (ECCV 2022, Xie et al.), and CONTHO (CVPR 2024, Nam et al.) would be beneficial to analyze the 3D reconstruction quality of results by HOIGS.
7. How are human-based 3D Gaussian reconstruction methods like ExAvatar fairly compared against HOIGS, given that they focus solely on 3D textured human reconstruction? Clarifying the evaluation scope and fairness of comparison would be helpful.
8. Section 3.4 describes the reconstruction of the background, while human and object reconstructions are modeled separately. How are these separately reconstructed components aligned or placed in a common 3D coordinate space?
9. In Figure 4, the qualitative results for 4DGS and E-D3DGS appear to correspond to different time frames, as the wooden box held by the human is in a different position. Is this discrepancy due to using different frames, or is it an inherent limitation of those methods?
10. The paper mentions the concept of “contact” multiple times, yet no contact-specific methodology is presented. Clarifying whether contact information is explicitly modeled, inferred, or simply discussed conceptually would improve the technical completeness of the paper.
Listed as part of weaknesses. |
Fully human-written |