ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 3665
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (33%) 4.00 4.00 2372
Lightly AI-edited 1 (33%) 4.00 3.00 2362
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 4.67 3.33 2800
Title Ratings Review Text EditLens Prediction
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose BIPE (Boost Identity-Preserving Embedding), a training-free for consistent text-to-image generation. The approach focuses on identity-preserving embeddings (IPemb) and introduces two techniques: Adaptive Singular-Value Rescaling (adaSVR) and Union Key (UniK). AdaSVR applies singular value decomposition to amplify identity-related components. UniK enhances consistency by concatenating cross-attention keys from all frame prompts. BIPE uses SDXL as the base model and is evaluated on ConsiStory+ and a newly proposed DiverStory benchmark. 1. The paper analyzes and visualizes the relationship between identity-preserving embeddings and attention mechanisms focused on the subject. 2. The authors design the DiverStory benchmark, which employs varied natural language prompt formulations rather than relying on a single fixed template as in ConsiStory+. 3. The paper provides numerous visual examples to illustrate results. 1. The motivation is not entirely clear. The authors claim that previous works overlook the fact that identity-relevant embedding components are already implicitly encoded within the aggregated textual embeddings of a full frame-prompt sequence. However, this limitation does not seem particularly significant, nor is it obvious that it would strongly affect results. 2. The description of the method is difficult to follow and not clearly structured. It required substantial time and effort to understand the novelty of BIPE and how it differs from 1Prompt1Story. Nevertheless, the proposed approach appears quite similar to 1Prompt1Story. For instance, the UniK component in BIPE seems analogous to Prompt Consolidation (PCon) in 1Prompt1Story, as both combine all prompts. Likewise, adaSVR in BIPE appears to resemble Singular-Value Reweighting (SVR) in 1Prompt1Story. The primary difference seems to lie in the explicit use of IPemb in BIPE. However, the distinction between explicit use in adaSVR and implicit use in SVR is not clearly explained. 3. The paper contains several typos. For example, $\bar{V}_i$ should be $\tilde{V}_i$ in Eq. (5). In Table 1, “Train-Free” should be “Train”, or “√” and “×” should be interchanged. Since BIPE is applied to video generation in the experiments, would it be more accurate to use the term “consistent visual generation” rather than “consistent text-to-image generation”? Lightly AI-edited
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles identity preservation in text-to-image (T2I) diffusion. It observes that a cross-frame, identity-bearing direction exists in the text-encoder embeddings. Building on this, the authors propose BIPE, a training-free, plug-and-play framework with two parts: Adaptive Singular-Value Rescaling (adaSVR) and Union Key (UniK). The paper also proposes DiverStory, a benchmark using varied natural-language prompts (not a single template), and reports gains on ConsiStory+ and DiverStory with moderate runtime/memory overhead. - Operating purely on text embeddings makes BIPE easy to attach to SDXL-like pipelines; the paper also shows a video case (Wan 2.2). - The IPemb observation (leading singular directions capture identity) is plausible and supported by attention-map probes. - On ConsiStory+, BIPE achieves the best CLIP-T and VQA, with identity metrics close to the best and better efficiency than training-heavy baselines; ablations indicate complementary roles for adaSVR and UniK. - DiverStory highlights robustness to varied natural-language prompts which is a realistic setting often under-tested. - Evidence suggests BIPE helps on both template-based and diverse prompts, but several core claims and implementation details are insufficiently justified (see “Questions”). - The empirical methodology is mostly standard, but ablations don’t fully isolate design choices (e.g., sensitivity to the weighting temperature, the role of per-layer SVD). 1. Is any finetuning performed anywhere (text encoder/adapters)? If truly training-free, please correct the Table-1 flag; if not, specify what is trained and where. 2. Do UniK keys/values come from adaSVR-enhanced embeddings (as in the main text) or the original embeddings (as suggested in the appendix)? Please standardize and report the performance delta between the two setups. 3. Provide per-layer SVD dimensions and a wall-clock/VRAM profile that separates the costs of adaSVR vs. UniK, and how these scale with number of frames (N) and subject token count. 4. Include sweeps for the temperature (\tau) in Eq. (3) and the number of frames (N); additionally, report robustness to token selection (([EoT]) vs. subject tokens) and layer-wise on/off. 5. Any human studies on identity consistency under Diverse Prompts? What is the release timeline/spec for DiverStory to enable community verification? Moderately AI-edited
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes BIPE, a training-free, plug-and-play framework that improves subject identity consistency in multi-image text-to-image generation by operating purely on text embeddings. BIPE has two components: adaptive singular-value rescaling (adaSVR), which spectrally amplifies identity-preserving directions in subject and [EoT] token embeddings across every layer of the text encoder, and Union Key (UniK), which concatenates cross-attention keys from all prompts while using per-frame values to align attention without leaking full values across frames. Experiments on ConsiStory+ and a new Diverse Prompts benchmark, DiverStory, show strong text alignment and competitive identity consistency with low memory and runtime overhead, and the authors also illustrate integration into Wan 2.2 for cross-video consistency. Originality is solid: rather than new networks or retraining, the work identifies and boosts an intrinsic identity-preserving component in text embeddings and enforces consistency via key-sharing in cross-attention, which is simple and broadly applicable. Quality is supported by clear math for adaSVR with energy-matched normalization, principled token selection for subject and padding tokens, and a practical 1/N weighting of extra key-value pairs to control dominance and cost. Clarity is generally high, with an end-to-end pipeline and ablations that isolate adaSVR vs UniK contributions. Significance is promising since BIPE is architecture-agnostic, requires no additional data or training, and achieves strong alignment and competitive identity metrics with near-base latency, while DiverStory broadens evaluation beyond templated prompts. The paper claims BIPE is training-free, yet Table 1 marks BIPE as not training-free on both ConsiStory+ and DiverStory, which conflicts with the text and should be corrected or explained. The evaluation emphasizes SDXL as the default and shows case studies with Wan 2.2, but broader quantitative tests on additional backbones would better support the architecture-agnostic claim. Identity consistency is mostly measured by CLIP-I and DreamSim with background removal; a small human study or per-attribute identity analysis would strengthen conclusions on visual identity. Finally, while the method uses only a subset of tokens in UniK to cap compute, sensitivity to the number and type of shared tokens, and scaling with the number of frames N, is not systematically profiled. a) Please reconcile the training-free claim with Table 1, which currently lists BIPE as not training-free. If this is a typesetting error, clarify and update; if not, explain what part of BIPE requires training. b) How does BIPE scale in runtime and memory with the number of frames and with the count of shared keys in UniK? A plot of latency and VRAM vs N and vs number of shared subject/[EoT] tokens would help practitioners. c) Beyond SDXL and the Wan 2.2 illustration, can you report quantitative results on at least one non-CLIP text encoder or a DiT-based T2I backbone to substantiate architecture-agnostic claims. d) Could you add sensitivity studies for adaSVR’s temperature and the decision to include [EoT] alongside subject tokens, plus an ablation on the 1/N weighting strategy. e) DiverStory is valuable; can you provide statistics on prompt diversity and subject types, along with plans and licensing for release, so the community can reproduce and extend your results. f) The limitations note that BIPE does not accept external identity references; can you outline how BIPE would integrate with reference-image encoders or identity embeddings while retaining the training-free property. Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next