ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	6.00	5.00	6166
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	8.00	4.00	2058
Fully human-written	2 (50%)	7.00	4.50	2160
Total	4 (100%)	7.00	4.50	3136

Title	Ratings	Review Text	EditLens Prediction
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	Conventionally, fake image detectors assign a real/fake label. However, detecting these fake images in the wild involves dealing with post-processing operations modify the original signal and hence affect their detection. The authors try to develop an effective method through which one can also predict the confidence of the detector. This is especially challenging since, neural networks are not well-calibrated. In order to do this, the authors leverage image quality to measure the confidence. The change in image quality, depends on the reference image, which is conventionally not available during test time, in order to predict this, the authors train the neural network to do so. Experiments show improved calibration and the authors also show how this can be used in effectively detecting fake images. 1. The problem of uncertainty estimation in fake image detection is both interesting as well as practically relevant. It is also an understudied problem. 2. The focus of the study on distortions also makes the paper practically relevant. 3. Multi-Detector Routing and Confidence-Based filtering are good use cases for the method. 1. The experiment reported in section 3 would benefit from the inclusion of more details. For instance, what is the data used, what is the amount to which each post-processing operation is applied, etc. 2. My main concern comes from the fact that the existing method seems to only account for single distortions. However, one can compose distortions (resize first, then blur later for instance). It is currently unclear to me as to how the method would work given these settings. 3. The limitations should be discussed with further detail. The issues that the current method has with respect to data coming from different sources would be interesting and insightful to the community. 4. The plots have extremely small text and it can be hard to follow, it would be better if the text in the plots are much bigger than they currently are. Especially for the plots in the appendix and Fig 2. Minor Weaknesses 1. Line 42-43: This statement is not correct. A lot of detectors use common post-processing operations as part of their training [1,2]. References, 1. Wang, S. Y., Wang, O., Zhang, R., Owens, A., & Efros, A. A. (2020). CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8695-8704). 2. Gragnaniello, D., Cozzolino, D., Marra, F., Poggi, G., & Verdoliva, L. (2021). Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. arXiv preprint arXiv:2104.02617. 1. Does the method currently account for multiple distortions, if not can it be made to account for this case? 2. For the training, why does equation 7 use MSE loss as opposed to the binary cross entropy loss?	Fully human-written
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes DACOM, a neural regressor that predicts the probability that a given forensic detector will correctly classify a (possibly distorted) image. The key insight is that FR-IQA scores correlate monotonically with detector accuracy conditioned on distortion type. Training labels are obtained by bucketing FR-IQA scores per distortion type and mapping bin-wise balanced accuracies to [0,1]. At inference DACOM uses detector features, NR-IQA features and a distortion-type embedding to estimate sample-level confidence without references. The score enables “selective abstention’’ and “top-1 routing’’ among detectors and improves overall accuracy on several benchmark. 1. The paper introduces a novel framework that tackles reliability estimation, which is orthogonal and complementary to existing works. 2. It provides a principled, detector-conditioned confidence definition and a practical solution that avoids requiring reference images at test time. 3. Extensive experimental validation on diverse detectors and a broad spectrum of distortions is conducted. Ablations clearly show each component’s contribution. 4. The paper is well written and easy to follow. 1. Evaluations rely mainly on Balanced Accuracy and EER. In practice, real and fake samples are imbalanced. Would precision-recall–based measures (e.g., AUC-PR, F1) change the conclusions? 2. The inference pipeline runs QualiCLIP, ARNIQA and the detector for every input may be costly on edge devices. A detailed analysis of timing/FLOPs is missing. 3. While abstention can improve safety, it also off-loads decisions to human operators and potentially offers adversaries a mechanism to trigger systematic abstention. The paper lacks discussion of these risks and possible mitigations. 1. How would the results change if Balanced Accuracy is replaced with AUC-PR or F1 measures? 2. Please report the inference overhead of DACOM and compare it to baselines. 3. Please add more discussions about ethical aspects. 4. Minor issue: Text in Fig. 2 and Fig.6 is tiny, please enlarge.	Lightly AI-edited
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper addresses the problem that forensic detectors for AI-generated images produce predictions without indicating reliability when test images undergo various distortions such as compression and noise. The authors propose DACOM (Distortion-Aware Confidence Model), which uses full-reference image quality assessment metrics to label training data with detectability scores during training, then learns to predict sample-level confidence using detector features, no-reference image quality descriptors, and distortion-type information at inference time. 1. The paper's motivation and perspective are well-founded and novel. Few existing works approach AI-generated image detection from the lens of image quality assessment to evaluate detector reliability under distortions. This angle provides a fresh and meaningful contribution to the forensics community. 2. The paper is well-organized with clear logical flow. The authors systematically progress from problem identification (Section 3 analysis) to method design (Section 4), making the work easy to follow. The empirical analysis establishing the correlation between FR-IQA scores and detection accuracy provides solid justification for the proposed approach. 3. The experimental evaluation is comprehensive and thorough. The authors provide extensive ablation studies (Section 5.5), test on multiple distortion types (both seen and unseen), evaluate across different datasets (Evaluation-dataset and Cross-dataset), and include detailed supplementary results in the appendix, all of which substantiate their claims with adequate evidence. 1. Concerns regarding the use of detector performance for labeling in Stage A. The authors use the detector's balanced accuracy on each bin to generate detectability labels (Equation 3-4). This raises several concerns: (a) If FR-IQA scores already exhibit monotonic correlation with detection performance (as shown in Section 3), why is the additional step of computing detector accuracy necessary? Could the FR-IQA scores themselves serve as supervision? (b) More critically, this design may limit generalizability—if training data is labeled using Detector A's performance, will DACOM trained on this data generalize well to Detector B? This detector-specific labeling could hinder practical deployment across different detection models. (c) The requirement to evaluate detector performance on large distorted datasets during Stage A significantly increases the computational cost of the training pipeline. 2. Limited discussion on robustness training and its interaction with the observed monotonicity. The authors address distortion robustness by applying light data augmentation (10% JPEG compression and blur) during detector training. However: (a) Only two distortion types are used for augmentation, which seems insufficient given the diversity of real-world distortions. (b) It remains unclear whether the monotonic relationship between FR-IQA and detection accuracy (Section 3) still holds when detectors are trained with more aggressive data augmentation strategies. If extensive augmentation flattens the performance curve across distortion levels, would DACOM's premise still be valid? This interaction between robustness training and the proposed method deserves further investigation. 3. Limited training data scale may restrict generalization. The model is trained on only 2,500 base images from a single dataset (ProGAN subset). While distortion augmentation expands this to 200K+ samples, the underlying content diversity remains limited. This could lead to: (a) Overfitting to the specific visual patterns in these 2,500 images. (b) Poor generalization to different content types, as evidenced by the noticeable performance drop in Cross-dataset evaluation (Table 4). Experimenting with larger and more diverse training sets would strengthen the claims about generalizability. 4. The proposed applications have practical limitations that merit further exploration. While the paper demonstrates two uses of DACOM—selective filtering and multi-detector routing—both have notable drawbacks: (a) Selective abstention necessarily reduces coverage, which may be unacceptable in applications like content moderation where all samples must be processed. (b) Multi-detector routing requires maintaining and running multiple detectors (6× computational cost in experiments), which may be prohibitively expensive for real-time systems. I suggest the authors explore alternative applications that leverage DACOM more seamlessly, such as incorporating confidence-aware calibration directly into a single detector's training or inference process, or using confidence scores to dynamically adjust decision thresholds rather than completely abstaining from prediction. Overall assessment: Despite these concerns, I view this as a valuable contribution that introduces a novel perspective on detector reliability. If the authors can adequately address the above concerns—I would be inclined to raise my score. 1. The formula y = 2×\|BAcc - 0.5\| assumes equal difficulty in improving accuracy across the entire range (e.g., 50%→75% vs. 75%→100%). However, achieving near-perfect accuracy is typically much harder than reaching moderate levels, suggesting a non-linear relationship. Have you experimented with non-linear transformations (e.g., y = (2×\|BAcc - 0.5\|)^α with α > 1, or logarithmic scaling) that better reflect the diminishing returns at higher accuracy? Alternatively, why not use BAcc directly as labels without transformation? An ablation study comparing different label functions would clarify whether this linear design is optimal or just a convenient choice. 2. You train four DACOM variants with different FR-IQA metrics (Table 6) and all perform similarly. Does this mean the choice of FR-IQA is not critical? If so, why present four variants instead of selecting one? What guidance do you offer to practitioners on choosing the FR-IQA metric? 3. What is the parameter count and FLOPs of DACOM? Since it must run alongside the detector, efficiency matters. How does DACOM's overhead compare to the detector itself (e.g., DACOM adds X% latency)?	Fully AI-generated
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors propose to train a predictor for calibration-type confidence in the sense of probability of the prediction being wrong. They do this specifically for GAN-generated image detection. To do this, they use detector features, features from a predictor of distortion type, features from a predictor of no-reference image quality, and use this with an MLP on top to predict the confidence via regression. They perform experiments on correlation between the predicted and the true calibration, they show the usefulness of the predictor for top-1 routing to GAN-generated image detectors, and for confidence based vote abstention, evaluated by a ranking measure. - They perform experiments beyond just training the calibrator. - It shows its usability for top-1 routing and for vote abstention / or low confidence flagging. - if one would want to hide the fact that an image was generated by a deep learning model, then image distortions are a natural candidate for obfuscation, so the setting makes sense - Clear idea - well readable paper - it has not the greatest novelty, it is not what one would think has to be shown as an oral - The dataset used for the main experiments, PROGAN. is from the pre-diffusion model era. It would be better to see the results for distortions also for diffusion datasets. They do this for the cross-evaluation in section 5.4 but it would be good to have done it also for section 5.3 and also for the confidence evaluation. none	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next