ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 4 (100%) 5.00 4.00 2558
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 5.00 4.00 2558
Title Ratings Review Text EditLens Prediction
TIPS: A Text-Image Pairs Synthesis Framework for Robust Text-based Person Retrieval Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the TIPS framework, an automated system for synthesizing text-image pairs, designed to address the problem of data scarcity in text-based person retrieval tasks under zero-shot, few-shot, and cross-domain scenarios. Its core innovation lies in two diffusion-based efficient generators: SPG, which generates seed images, and IDPG, which expands images while preserving identity consistency. Additionally, it includes a comprehensive LLM/ MLLM-integrated pipeline and a test-time augmentation strategy. 1. This paper is of practical value, as it addresses the issues of identity consistency and diversity in data synthesis, thereby expanding text-image pairs for text-based person retrieval. 2. The framework is comprehensive, covering the entire process from text generation to final training data synthesis, and can be extended to other multimodal synthesis tasks. 3. The experiments on dataset quality evaluation are convincing, as demonstrated by the results under zero-shot, few-shot, and generalization scenarios presented in the paper. 1. The paper presents qualitative results but lacks quantitative evaluation of the identity consistency in IDPG-generated images (e.g., using a pretrained face or ReID model to compute feature similarity between generated image pairs). 2. The overall pipeline quality relies on the accuracy of the MLLM serving as a “judge.” However, the potential biases and errors of the MLLM itself may be introduced into the synthesized data, which has not been thoroughly discussed. 3. The generation cost is relatively high; although the model is lightweight, producing 400k pairs of samples still requires considerable time and computational resources. 1. Could a quantitative evaluation of the identity consistency be conducted for the set of images generated by IDPG? 2. Does the TTA significantly increase the inference overhead? 3. The MLLM may make mistakes during the filtering and annotation process. Have you investigated potential errors and analyzed how these errors might affect the quality of the synthesized data and, consequently, the performance of downstream TPR models? 4. Does the generated data exhibit any “background or pose patterning” issues? Could you provide diversity statistics to illustrate this? Lightly AI-edited
TIPS: A Text-Image Pairs Synthesis Framework for Robust Text-based Person Retrieval Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a Text-Image Pairs Synthesis (TIPS) framework to address practical challenges of TPR in real-world scenarios, including zero-shot adaptation, few-shot adaptation, and robustness issues. Two person image generators, SPG and IDPG, are introduced to synthesize realistic, identity-consistent pedestrian images. Additionally, TIPS incorporates a caption generator and a filtering mechanism to enhance data quality. Furthermore, a test-time adaptation (TTA) method is proposed to further improve retrieval accuracy. 1. The paper provides a comprehensive exploration of practical challenges in TPR tasks, such as zero-shot adaptation, few-shot adaptation, and robustness, which are critical for real-world applications. 2. The experiments are extensive and the analysis is in-depth, providing valuable empirical insights. 1. **Logical Inconsistency**: In the Introduction, the paper argues that existing methods typically rely on real person images, limiting extensibility and scenario diversity. However, in the methodology, the collection of training data in this work also involves gathering real-person images, which appears inconsistent with the stated motivation. 2. **Presentation and Reproducibility**: The descriptions of SPG and IDPG in the methodology section are rather opaque, making it difficult to fully understand the specific generation processes. Moreover, the correspondence between the textual descriptions and Figure 2 is unclear, which further hinders comprehension. Additionally, the writing in the methods section lacks technical rigor and professionalism. For example, in the S3 stage, the paper merely states that the outputs are "further evaluated for identity and outfit consistency with the seed image," but does not specify how MLLMs are utilized for evaluation, what the evaluation criteria are, or how generation quality and identity consistency are measured. Such methodological details are essential for reproducibility and for ensuring the technical soundness of the proposed approach. 3. **Novelty**: The proposed framework is largely an engineering integration of existing generation techniques for data augmentation under different scenarios. While practically valuable, the paper lacks substantial methodological innovation, which may limit its impact on future research. 4. **Experimental Setup**: The zero-shot setting samples images from CUHK03, CUHK02, Market-1501, MSMT17, and VIPER. However, the downstream dataset CUHK-PEDES contains images from CUHK03, Market-1501, and VIPER, while ICFG-PEDES and RSTPReid contain images from MSMT17. This setup may lead to identity overlap, which contradicts the claimed zero-shot setting. Additionally, there is a concern that test images from these datasets may inadvertently be included in the training set, potentially leading to information leakage. 1. In Table 3 (generalization scenario), what is the difference between "raw" and "ours" in the training data? Please clarify this in the paper to avoid confusion. 2. The Introduction states that existing datasets suffer from poor text-image alignment, yet recent works [1,2] have focused on person image captioning. How does the proposed caption generation method ensure higher quality and greater diversity compared to these methods? [1] Jiang J, Ding C, Tan W, et al. Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 9220-9230. [2] Tan W, Ding C, Jiang J, et al. Harnessing the power of mllms for transferable text-to-image person reid[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 17127-17137. Lightly AI-edited
TIPS: A Text-Image Pairs Synthesis Framework for Robust Text-based Person Retrieval Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a fully automated text-image pair synthesis framework, TIPS, to address critical challenges in Text-based Person Retrieval (TPR), such as poor zero-shot adaptability and the low quality and limited practicality of existing synthetic data. For the first time, they generate a high-fidelity, identity-consistent pedestrian image dataset with controllable resolution solely from textual descriptions, and further introduce a complementary enhancement strategy—Test-Time Augmentation (TTA). 1. This paper clearly identifies the current bottlenecks in the TPR task, and the authors' motivation for proposing an automated text-image pair generation pipeline is well-justified. 2. The authors also propose a plug-and-play Test-Time Augmentation (TTA) strategy that enhances the performance of existing methods, and experimental results demonstrate the superiority of their approach. 1、Regarding the proposed TTA module, although the experiments demonstrate its effectiveness, the introduction of this component appears somewhat abrupt relative to the overall motivation of the paper. The TTA mechanism seems not to be conceptually aligned with the core objective of the work. 2、For the proposed dataset, the paper (including the supplementary material) does not provide detailed statistical information or descriptive analysis. This lack of dataset characterization limits the reader’s understanding of its scale, diversity, and quality. 3、The ablation study section is rather limited. For instance, one distinctive feature of TIPS is the ability to control the pixel quality of generated images. However, the paper does not investigate whether image resolution or pixel-level control affects the final retrieval performance. Detailed comments can refer to the weaknesses section. Lightly AI-edited
TIPS: A Text-Image Pairs Synthesis Framework for Robust Text-based Person Retrieval Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the TIPS framework, which uses two diffusion generators (SPG/IDPG) together with an MLLM to fully automate the synthesis of high-fidelity, diverse text–pedestrian image pairs. At inference time, TTA fuses the visual features of synthesized preview images with text features, delivering steady gains without modifying the model. The method shows improvements on CUHK-PEDES, ICFG-PEDES, and RSTPReid under zero/few-shot and cross-domain settings. 1.Provides a fully automated, scalable data-generation pipeline, from prompt generation to synthesis, data filtering, and automatic description, capable of batch-producing high-quality text–pedestrian image pairs. 2.Achieves significant gains in zero/few-shot settings with strong sample efficiency. 3.Requires no network modifications: at inference, fusing “preview image” features with text features enhances consistency and boosts performance. 1.The pipeline is relatively complex overall, relying on LLMs/MLLMs and generators, which raises implementation complexity. 2.Each scenario requires training the generators first; expanding to 400k pairs incurs substantial computation and time costs. 3.A preview image must be generated at inference, adding 2.75s per query; latency increases markedly for methods without reranking. 4.SPG may produce appearance/identity inconsistencies across runs under the same prompt, so it depends on IDPG and MLLM filtering, failures in these stages can degrade quality. 5.The TTA fusion coefficient α requires empirical tuning and may need to be adjusted across methods/datasets. 6.Data scoring and description are produced by an MLLM, so stylistic biases or preferences may be injected into the synthetic data, affecting downstream distributions. 1.If identity drift occurs, what is its frequency, and what proportion of cases are corrected by the IDPG + MLLM filtering? 2.Is the per-scenario training time cost excessively high? 3.Please quantify the stylistic diversity and similarity of the generated texts, compare results across different LLMs/templates, and evaluate whether MLLM-produced descriptions introduce stylistic bias that affects downstream retrieval. 4.In scenarios with available annotations, SPG can be trained directly on target-domain image–text pairs. Please explain how you prevent leakage or overlap with the test distribution/identities. If these concerns are addressed, I will raise my score. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next