|
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed a method for saliency object ranking. A story prediction module predicts the caption of the image and a guided ranking module predicts the saliency rankings. The cyclical interaction module aligns and refines the caption and the ranking iteratively. The experimental results seemed to show the proposed method outperformed previous SOTA.
- The cyclical interaction uses caption to guide saliency object ranking.
- Ablations shows the effectiveness of the SITA and CMQC in the proposed method.
- The segmentation head is unclear. The performance increase could potentially due to using a strong pretrained segmentation model.
- The retained QAGNet has lower scores across metrics compared to the ones reported in the original paper. This is critical since the results of the proposed method does not outperform the reported results of QAGNet.
- Is the segmentation head a pretrained segmentation model or else? A strong segmentation model could favor the MAE.
- What is the impact of number of object queries on results? An ablation study will be beneficial to see the impact.
- Would a stronger text decoder leads to better performance?
- What is the reason of decreased performance of retrained QAGNet? Did the authors use different training details or different evaluation settings or else?
- Intuitively, the proposed method could also improve the performance on image captioning task. I am wondering if salient object ranking could help with image captioning. It will be interesting to see results compared with SOTA image captioning methods. |
Fully human-written |
|
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a novel framework that models the cyclical interaction between perception and viewing for the Salient Object Ranking (SOR) task. The method introduces two key components: a Story Prediction (SP) module that simulates the human perceptual process through image caption generation, and a Guided Ranking (GR) module that predicts saliency rankings under the guidance of the SP module.
(1)Novel Cognitive-Inspired Framework.
The paper introduces a cyclical perception–viewing model inspired by human visual cognition, which is strongly supported by established cognitive and psychological theories. And the introduction is easy to follow.
(2)Extensive experiments.
The paper conducts both qualitative and quantitative experiments, and also provides an analysis of inference time. Moreover, the visualized experimental results clearly and intuitively demonstrate the improvements achieved by the proposed method.
(1)The paper lacks a clear comparison with the recent top-down method, Language-Guided Salient Object Ranking (CVPR 2025), and its performance remains inferior to the results reported in that study.
(2)In the Method Overview section, the symbols used in the equations do not correspond to those shown in Figure 2, which makes it confusing to understand the inputs and outputs of each module.
(3)The experimental section mainly provides data and setup details but offers limited analysis or discussion to explain the observed results.
(1) Explain the differences between the proposed method and Language-Guided Salient Object Ranking (CVPR 2025). Moreover, the performance of this method is still inferior to that of the existing work.
(2) In Eq.(1), when $l=1$, what dose $Q_{l-1}$ refer to?
(3) In Table 2, for Setting II (“independent caption generation”), there is a performance improvement even without interaction between caption and visual features, which is confusing. Could the authors clarify this behavior?
(4) How is the ground-truth (GT) caption obtained? |
Fully human-written |
|
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a Salient Object Ranking (SOR) approach that consists of two modules: the Guided Ranking (GR) module and the Story Prediction (SP) module, whose interaction enhances the overall performance of SOR.
The design of this model aligns well with the human cognitive system, such as predictive coding, and the English writing is good and clear.
Some experiments and details are not clearly explained. For example, in the experimental section, how was the choice of 24,000 epochs determined, and why such a large number? Could this lead to overfitting?
In addition, it would be helpful to qualitatively present the interaction between object queries and text features, as well as the results under different values of K.
What training data are used for the segmentation head? Was it pre-trained on the COCO segmentation dataset? If it was trained only on the SOR dataset, would its segmentation generalization ability be affected?
Does the random selection of captions influence the results? It is recommended to include a discussion—for example, are the salient objects in the image always located in the main subject position described in the caption? |
Lightly AI-edited |