ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 4.00 1676
Fully human-written 3 (75%) 3.33 4.00 1790
Total 4 (100%) 3.50 4.00 1762
Title Ratings Review Text EditLens Prediction
Chimera: Compositional Image Generation using Part-based Concepting Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Chimera, which is a framework for compositional generation based on parts. The authors propose a technique based on learning of IP-Adapters for each part, output of which is then composed to get the required image. The authors create a dataset for training of HiDream-11 model to generate the dataset. Paper is clearly written and is understandable. The problem is interesting to solve. Results are Unnatural and contain artifacts: In Fig. 1, the results in rows 1,2, and 4 look quite unnatural; further, there are artifacts in the head region of Row 1. Novelty: I find the proposed approach to be very similar to the approach in Piece-It-Together. Hence, the novelty of the proposed method is unclear. Soundness: I find that the dataset for learning itself contains artifacts (Fig. 1, showing the aeroplane and sunflower). So unclear to me how it can be used to train a model to do a part composition task with bad supervision? The performance of the Omni-Gen2 and Chimera is very similar. Hence, the exact advantage of the method needs to be clarified. The part identity of the parts is not similar to the input images provided (Fig. 6). Could authors provide a reason for that? Some recent works like [21, 11, 15] are not considered as baselines for comparison. Could the authors provide reasoning for that? Fully human-written
Chimera: Compositional Image Generation using Part-based Concepting Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Chimera — a personalized image generation model that creates novel hybrid objects by combining parts from different input images. The main contribution of this paper is designing a part taxonomy that contains 464 unique <part, subject> semantic atoms. This taxonomy enables the generation of a large amount of part combination data. However, the paper lacks innovation in aspects such as model structure and experimental setup. - The proposed ⟨part, subject⟩ taxonomy covers 6 semantic domains and 464 semantic atoms, avoiding the limitations of baselines (e.g., PartCraft is restricted to "birds/dogs," and PiT focuses on "toy creatures"). - The paper proposes a new metric that leverages MLLMs to evaluate the part-influenced generation. - The second innovation claimed in the paper—"The model does not require masks"—is meaningless. This is because the authors utilize Grounded SAM for image segmentation, which offers no novelty whatsoever. - The paper provides insufficiently clear descriptions of the training details. - From a qualitative results perspective, the consistency between the parts of the images generated by the model and the input parts is not good. - During model training, are the input part images segmented from ground truth images using Grounded SAM? If so, how to ensure the model's generalization ability? - Qualitatively, the generated images show very weak adherence to the input parts and text prompts in the 4-part compositions setting. This is particularly evident in the numerous cases presented in Figure 11, where the generated content barely captures the input features. A further explanation on this issue is expected. Lightly AI-edited
Chimera: Compositional Image Generation using Part-based Concepting Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces Chimera, a pipeline for compositional image generation that aims to combine specified parts from different source images based on text prompts. The core methodology relies on three key components: (1) SAMv2 for identifying parts, (2) Diffusion Prior (IP-Prior) trained on a large, synthetic corpus of hybrid concepts, and (3) a decoder. Also, the authors propose a metric PartEval, using Gemini-Flash for evaluating semantic compositional quality of the generated images. 1. By intelligently integrating a strong segmentation model (SAMv2), the approach eliminates the primary usability bottleneck for previous part-based generation models. 2. Part-Eval provides a novel metrics that is essential for measuring the success of compositional tasks. This moves the field beyond general image quality metrics (like FID) which are insufficient for judging correct part-blending. 3. Its interesting to see how the proposed approach successfully focuses the difficult learning task (compositional blending) onto a small, dedicated model (the IP-Prior) while leveraging the decoding power of a much larger, pre-trained model. 1. An important step in validating the PartEval metric is missing. The paper does not report running PartEval against the HiDream-11 ground-truth images used for training. This omits the essential check required to verify that the Gemini pipeline accurately assigns a perfect score to the ideal compositional result (which is being used as ground truth during the training). 2. SAMv2 obtained masks play a critical role in the practical utility of the proposed method. The authors should perform analysis demonstrating the IP-Prior's robustness to segmentation errors. 3. The training structure suggests the model may be relying on memorized structural templates (slots for head, body, wheels, etc.) derived from the limited taxonomy, rather than learning general blending logic. The authors should potentially discuss its ability to extrapolate to novel or conflicting spatial configurations in order to prove the robustness of the approach. 4. In the fig 6. qualitative results., some results aren't that good: e.g, in the first column, the generated image does seem to have impact from zebra's body (see the thicker black tail) and not just zebra's head. I would appreciate if the authors can provide an explanation to the same. 1. I wonder how reliable HiDream-11 generations would be? For the examples shown in figure 3, for the prompt mentioning a plant with petals of rose and leaves of sunflower the generated image also has the yellow petals from sunflower which may not be desirable. 2. Is the IP-prior sensitive to perturbations/minor errors in the SAMv2 masks? 3. I would appreciate if the authors can provide some intuition for how IP-prior would resolve potential spatial conflicts in a potential such scenarios when you have multiple input parts to be used in the composition. For instance, if the input had a cow and a horse and the user wants both of them to play a role in how the head of the compositionally generated object looks like, would the model perform some blending, or pick one of the two? Fully human-written
Chimera: Compositional Image Generation using Part-based Concepting Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a personalized image generation model that generates new objects by combining specified parts from different source images according to textual instructions. A training dataset is built with <part, subject> pairs. A metric PartEval pipeline is introducted to assess the fidelity and compositional accuracy. - The task of compositional generatation of new objects by combining specified parts from different source images according to textual instructions is investigated. - The experiments are reported and analysis. - Qualitative results comparisons is insufficient. - Failure cases are not provided. - The number of compositional objects is small, restricting generalization to complex, real-world scenes with diverse element combinations. - The code is not provided, hindering reproducibility and further validation of the proposed approach. See above. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next