ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 6.00 4.00 2044
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 6.00 3.00 2119
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 4.00 3.00 2435
Total 4 (100%) 5.50 3.50 2161
Title Ratings Review Text EditLens Prediction
SeMa3D: Lifting Vision-Language Models for Unsupervised 3D Semantic Correspondence Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes SeMa3D for unsupervised dense semantic correspondence between 3D shapes, targeting non-isometric deformations and inter-class matching where functional-map pipelines struggle. The method lifts multi-view visual features from VFMs, enforces view-consistent colorization, augments descriptors with language embeddings for part names, and optimizes a functional-map objective coupled with a region-aware contrastive loss that encodes part-to-part relations. Experiments show lower geodesic error than prior art on inter-class SNIS and on non-isometric SMAL and TOPKIDS, while matching state of the art on near-isometric FAUST, SCAPE, and SHREC19. - Originality: The paper lifts both visual and linguistic cues from VLMs to 3D surfaces and injects semantic priors through a region-aware contrastive loss, which is a clear step beyond purely geometric or visual-only descriptors. - Quality: The method is carefully constructed with view-consistent colorization, multi-view back-projection, semantic region proposals, a functional-map objective, and an explicit semantic contrastive loss. Ablations isolate contributions of colorization, SigLip language embeddings, and the contrastive loss. - Significance: SeMa3D substantially improves inter-class matching on SNIS and offers consistent gains on non-isometric datasets while retaining top performance on classical near-isometric benchmarks, suggesting practical benefits for cross-category and deformable scenarios. - Performance relies on high-quality texture synthesis and zero-shot part segmentation. Failure modes from colorization artifacts or segmentation errors are only partially explored. - The approach assumes a fixed part proposal set. It is unclear how sensitive results are to vocabulary choices, cross-category ambiguities, or missing parts. - How robust is SeMa3D to segmentation noise and to different part vocabularies. Please quantify sensitivity to incorrect or missing regions and report failure cases. - Do results hold on real scanned meshes with holes and partiality, or in downstream tasks like cross-category part transfer or manipulation. Any small scale study would be helpful. Fully AI-generated
SeMa3D: Lifting Vision-Language Models for Unsupervised 3D Semantic Correspondence Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes SeMa3D, an unsupervised framework for dense 3D semantic correspondence that tackles non-isometric and inter-class matching settings where classic functional-map pipelines struggle. The method synthesizes view-consistent textures for untextured meshes, extracts multi-view semantic features with SD-DINO, and augments them with SigLIP text embeddings keyed to zero-shot part proposals, and couples these descriptors with a functional-map objective plus a region-aware contrastive loss to inject part-level priors. Across SNIS (inter-class), SMAL/TOPKIDS (non-isometric), and FAUST/SCAPE/SHREC19 (near-isometric), SeMa3D matches or outperforms recent baselines, with especially notable gains on inter-class SNIS. Ablations indicate that view-consistent colorization, adding language features, and the region-aware contrastive term each contribute measurable improvements. 1. Concatenating SD-DINO image features with SigLIP text embeddings tied to zero-shot part labels yields descriptors that disambiguate semantically similar but geometrically dissimilar regions—key for inter-class cases. 2. The dynamic-margin formulation encodes distances over a semantic-region graph and complements the functional-map objective, improving robustness to segmentation noise. 1. Success depends on a chain of external components—view-consistent texturing, multi-view rendering, zero-shot region proposals, SD-DINO, and SigLIP—so errors can cascade; real-world scans with incomplete geometry/texture or unusual categories might degrade performance. The paper could include an analysis of the error accumulations. 2. While SeMa3D shows improvements, part of the edge over DenseMatcher could stem from using SeMa3D’s own zero-shot region proposals to run that baseline. DenseMatcher projects multi-view 2D foundation features to meshes, refines with a 3D network, and solves correspondences via functional maps, evaluated on a colored-mesh dataset (DenseCorr3D) for robotic manipulation. What are the biggest differences between SeMa3D and DenseMatcher? What is the performance of SeMa3D on the DenseCorr3D benchmark? Moderately AI-edited
SeMa3D: Lifting Vision-Language Models for Unsupervised 3D Semantic Correspondence Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tackles dense 3D semantic correspondence under severe non-isometry and inter-class matching scenarios. The authors propose SeMa3D with the following components: 1. synthesizes multi-view consistent and natural texture from generative models 2. lifts multi-view visual features from SD-DINO and fuses them with language embeddings from SigLIP to compose vertex features 3. optimizes a functional-map correspondence, with a region-aware contrastive loss guided by part proposals from pre-trained model. On a wide range of inter-class, intra-class and non-isometric benchmarks, SeMa3D produces superior performance comparing to various baseline methods. 1. The overall paper is with clear problem focus and motivation: to addresses the 3D correspondence in inter-class and non-isometric cases, simple geometric cues alone will usually fail, thus integrating more semantic feature from a wide range of pre-trained models explicitly should always help. 2. The combination of view-consistent colorization, and the fusion of SD-DINO visual features with SigLIP text embeddings seems straightforward and works well. 3. The authors have conducted extensive experiments to compare the proposed method with a wide range of baselines, also did extensive ablation experiments to validate each proposed components. The work relies heavily on directly using a combination of V(L)FMs and texturing model. While the performance gain from these components is naturally expected, the authors didn't fully analyze the potential failure modes from these components, e.g. bad textures, occlusion bias, prompt sensitivity. At least some qualitative results and analysis on inconsistent texture or noisy prompts are helpful. Similarly, part segmentation outputs can be very noisy, while the loss uses margins to tolerate it, the paper lacks some qualitative robustness analysis vs. segmentation quality. The adopted multi-stage pipeline may be heavy, current paper didn't give a thorough runtime and resource details, also it'd be good to compare this metric to other baselines. Mostly see the weaknesses above, some other questions: 1. What are the end-to-end runtimes per pair, and memory requirements? Any caching or down-stream speedups once features are precomputed? What' the cost for train and test, respectively. 2. How sensitive is the overall performance to the specific part list or to SigLIP’s language prompts across datasets? Fully human-written
SeMa3D: Lifting Vision-Language Models for Unsupervised 3D Semantic Correspondence Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes SeMa3D, a new framework for unsupervised dense semantic correspondence between 3D shapes, especially targeting non-isometric deformations and inter-class matching (e.g., human-to-horse). The core idea is to leverage vision-language models (VLMs) to extract semantic features from both visual and linguistic domains, and integrate them into a functional map framework to establish robust and semantically meaningful 3D correspondences without any manual annotations. The method introduces Vision-Language Models for 3D Correspondence, where they use SD-DINO (visual) and SigLip (linguistic) models to extract semantic features from multi-view renderings. It combines visual and text embeddings to form rich, semantic-aware vertex descriptors. The method applies SyncMVD for high-quality, cross-view consistent texture synthesis on raw 3D shapes, which ensures stable and consistent multi-view feature lifting. The method achieves Zero-Shot Semantic Region Proposal by using SATR for zero-shot 3D part segmentation (e.g., head, leg, torso) without manual annotations, enabling semantic region-aware matching. The method uses several pretrained models, which may have a strong dependence on their performance. Are there any correction schemes if the pretrained models are inaccurate? The method is designed for 3D meshes with consistent topology. It is limited to Mesh-Based Shapes and cannot directly apply to point clouds or unstructured 3D data. The pipeline involves multiple stages: texture synthesis, multi-view rendering, VLM feature extraction, and functional map optimization. It may be slower than end-to-end or purely geometric methods. There is no explicit handling of severe topological changes. While robust to topological noise (e.g., TOPKIDS), it may struggle with extreme topological variations not seen during training. Please answer and discuss the problems in ``Weakness''. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next