|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents SCOUT, which addresses key challenges in autonomous robots performing navigation and manipulation in complex environments and achieve SOTA performance in ALFRED benchmark. SCOUT introduces two main components: Spatial-Aware Continual Scene Understanding with a Scene Modelling Module and Switch Policy, which together achieve coordinating navigation and manipulation. The experiment demonstrates SCOUT’s effectiveness in navigating and manipulating objects on complex, long-horizon tasks with varying degrees of guidance.
SCOUT combines Spatial-Aware Continual Scene Understanding with an adaptive Switch Policy, which allows real-time switching between long-term planning and immediate task handling. This flexibility improves both task success and efficiency. The experimental results on the ALFRED benchmark demonstrate SCOUT's superiority, surpassing previous methods such as DISCO. The experiment design is comprehensive, thoroughly evaluate the SCOUT's performance and effectiveness of each part.
1. There are many powerful vision foundation models (GroundingDINO, DINOv1-v3, SAM, Embodied-SAM, etc) that could achieve similar functionality of scene understanding and mask query module. Though the effectiveness has been proved by the experiments, the motivation for training a semantic segmentation model is unclear.
2. The navigation functionality is too simple; the environment used in the experiment lacks obstacles, almost a clean floor, so the agent could easily move around without any path planning.
3. The AlFRED benchmark cannot catch the latest development of Embodied AI, in terms of visual realism, task complexity and interaction diversity. The author should use some recent challenging benchmark/simulation for evaluation.
Referring to my weakness paragraph. |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The manuscript pursues improvements in 3D spatial awareness on mobile manipulation tasks in the ALFRED simulated environments. The manuscript also claims novelty according to a proposed dual-planning approach, implemented as what is referred to as a ‘Switch Policy’, enabling a short-term planner to interrupt the task execution of a long-term planner if more immediate goal becomes available.
- The paper is well-written and well-organized.
- The paper provides a good amount of experiments, enabling discussion to be had and insights to be drawn.
- The paper considers a compelling task in embodied ai for mobile manipulation.
- L16-17: The manuscript lists, “Spatial-Aware Continual Scene Understanding with a Scene Modeling Module for effective scene modeling…”. This statement is inherently ambiguous without context; doesn’t really add much. Please rephrase.
- Section 3.2: I have some concern that the proposed approach — in particular, the Switch Policy — is tailored to the ALFRED environment. I would feel much more confident if the benefits of the proposed approach could be also exemplified in another simulator/dataset or in the real world.
- Section 3.2: I want to explore why the Switch Policy is necessary. Alternative planner designs are surfacing where a reasoning agent leverages an adapting contextual representation (map, local scene graph, keyframe history) and has a balanced (re-)planning frequency; together, these may provide sufficiently adaptive behavior in a single planner, rather than the two-planner design.
- L119-124: ‘Neural policy’ is not the most satisfactory dimension of comparison between the proposed method and the related work; perhaps a more defining feature of the proposed method can be emphasized in comparison with the limitations in the prior art?
- Table 1: Why do DISCO results change much less dramatically when step-by-step instructions are no longer available, compared to the proposed method? |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SCOUT, a unified framework for navigation–manipulation coordination. SCOUT addresses loss of historical context, inconsistent scene representation, and rigid control strategies through (1) Spatial-Aware Continual Scene Understanding module that builds a temporally consistent and semantically rich 3D scene representation using cross-attention between current and historical observations, coupled with a Mask Query Module for precise interaction mask generation without relying on depth estimation and (2) a Switch Policy that dynamically alternates between long-term navigation planning and short-term reactive manipulation when interaction opportunities arise. The proposed method is evaluated on the ALFRED benchmark and achieves state-of-the-art success rates.
- The proposed method achieves strong performance over prior work with large margins.
- Updating the semantic spatial map without depth estimation is intriguing and sensible.
- The proposed semantic spatial map can be learned end-to-end, implying its applicability to other learning-based modules.
- The spatial-aware continual scene understanding module assumes perfect actuation of the robot, assuming that the robot can move on to the adjacent cell with no errors. How sensitive is the scene understanding module to these errors? And, does the proposed method still work with the imperfect actuation?
- The authors argue that previous depth-object-mask co-projection (L050) causes error accumulation from inaccurate depth-semantic lifting, but this is not supported by any evidence. Relevant quantitative analyses are missing.
- The Switch Policy is inspired by a specific failure mode in a downstream task, raising a concern of its generalizability. What if simply making FOV bigger? Are the Switch Polity still needed in this case?
- In the Switch Policy, the short-term planner predicts a binary indicator if the current status is manipulable or not given an egocentric observation. Why not just use semantic segmentation masks? If the agent can manipulable, there should be some objects manipulable in its view and their masks should be detected, accordingly.
- SCOUT is sensitive to the choice of the grid size as mentioned in Sec.4.3, specifically tuned to a downstream task. It is nontrivial to determine the hyperparameter for specific downstream tasks.
- The proposed approach is validated in a single benchmark, raising its generalizability concern. Can the proposed method be used for other types of embodied mobile manipulation, such as HomeRobot, TEACh, etc.?
- Can the proposed approach be extended to other datasets with unknown camera parameters?
- How much computational cost is needed for high resolution of the semantic map, given that space complexity is $\Theta(HW)$?
- In Table 3, what is "Image Semantic Seg."? Is it from a pretrained semgnetaion model? Its description is unclear. In addition, Are both "Map" and "Image" modules learned with the same training dataset? |
Fully human-written |
|
SCOUT: Spatial-Aware Continual Scene Understanding and Switch Policy for Embodied Mobile Manipulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses embodied mobile manipulation on the ALFRED benchmark. The authors argue that prior agents suffer from three coupled issues: (1) historical information is lost when policies act only from the current egocentric view, (2) scene representations are inconsistent because 2D predictions are lifted to 3D through noisy depth, and (3) execution is rigid because the agent cannot interrupt a long navigation plan when a nearer manipulable target appears. The proposed method, SCOUT, has two main components. First, a Spatial-Aware Continual Scene Understanding module maintains a BEV-like 3D scene feature over time using spatial cross attention (projecting 3D points into the current image to fetch the right 2D features, thus avoiding depth estimation) and temporal cross attention (fusing the previous scene feature to preserve memory). It also has a mask-query module that pools object-specific 3D features and aligns them with 2D image features to produce pixel-level interaction masks. Second, a Switch Policy combines a rule-based long-term planner (BFS over the semantic scene map) with a learned short-term planner that can interrupt navigation whenever an object is both semantically correct and spatially reachable. On ALFRED, this yields higher success rates than prior work, including DISCO, in both seen and unseen settings and improves path-length–weighted metrics.
1. The paper is well motivated: it spells out three concrete failure modes in existing embodied agents (history loss, inconsistent 3D grounding, non-adaptive execution) and maps each to a specific component of the method, so the design is coherent.
2. The scene-understanding part is a sensible adaptation of BEV-style and deformable-attention ideas to single-view, temporally accumulated embodied data: instead of lifting 2D to 3D with predicted depth (which causes error accumulation), it pulls 2D features from projected 3D points, and then keeps temporal consistency with a dedicated temporal cross attention.
3. The Switch Policy directly targets a real ALFRED failure case: once the agent turns, a closer instance of the target may appear, and executing the long plan is wasteful. The proposed dual planner (long-term BFS + learned short-term classifier) is a simple but effective way to cut extra steps, which is supported by higher path-length–weighted metrics and the qualitative example.
4. Ablation studies are thorough. Removing temporal cross attention causes large drops; removing the switch policy reduces both success and efficiency; using ground-truth masks gives only small gains, which means the proposed mask-query module is already strong. This makes the main result credible.
5. The method achieves clearly better numbers than strong baselines under both step-by-step and goal-only instruction settings on the official test split, which is nontrivial for ALFRED.
1. On the perception side, the contribution is mostly integrative. The method reuses established ingredients (BEV-style scene feature, deformable attention, 3D→2D projection, 2D–3D feature alignment) and repackages them for ALFRED. The novelty is more in the way these are combined and supervised than in a fundamentally new learning component.
2. The switch policy is only partly learned. The long-horizon component is still a hand-designed BFS over the semantic map, and only the short-horizon “is this manipulable now?” part is trained. This makes the contribution feel somewhat engineered and raises questions about portability to other simulators or to real robots where the semantic map is noisy.
3. The evaluation is confined to ALFRED. Because the method is tuned to ALFRED’s discretization (25 m × 25 m, 25 cm grid, 100 × 100 best) and to its task structure, the generality of the approach is not fully demonstrated. Even a small experiment on a second embodied benchmark would strengthen the claim.
4. The model is relatively heavy (100 × 100 grid, spatial and temporal deformable attention, two segmentation losses), and training uses 8 GPUs for a day, but there is no careful runtime/latency comparison to prior work. For practical embodied deployment, this information would be useful.
5. The paper itself admits that it cannot handle open-vocabulary objects or reason about objects hidden/contained inside others, which are active directions in current embodied AI.
1. Can the switching decision itself be learned end-to-end (for example, via RL over the two planners) rather than partially hand-coded?
2. How sensitive is performance to the 100 × 100 grid if room sizes or step sizes change? You show that 80 and 120 are worse, but would a different dataset require re-tuning this resolution? |
Fully AI-generated |