ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 3.00 4.50 4824
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 3.00 1293
Fully human-written 1 (25%) 0.00 5.00 1216
Total 4 (100%) 3.00 4.25 3040
Title Ratings Review Text EditLens Prediction
Blur to Focus Attention in Fine-Grained Visual Recognition Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes DEFOCA, a stochastic defocus layer that applies Gaussian blur to image patches to encourage fine-grained visual recognition (FGVR) models to focus on discriminative regions. The method is theoretically grounded—analyzing label-safety, representation drift, and generalization—and can be easily integrated into existing architectures. Experiments show competitive results across multiple FGVR and ultra-fine-grained datasets. 1. Simple, lightweight, and architecture-agnostic design requiring no additional supervision. 2. Solid theoretical analysis connecting stochastic blurring with representation stability and generalization. 3. Comprehensive experimental evaluation with clear visualization of attention and feature clustering. 4. Good writing and clear motivation for tackling the challenges of fine-grained visual recognition. 1. Performance is weak compared to the state-of-the-art. As shown in Table 1, the mean accuracy of DEFOCA-Tiny ViT (91.9%) is significantly below the current best (93.5%) despite similar or larger model scales. 2. Marginal gain in parameter efficiency. Even considering model size, ACNet-R50 (25 M parameters, published five years ago) achieves 91.7%—only 0.2% lower—suggesting the improvement is not statistically significant at this scale. 3. Unbalanced model comparison. The authors compare their Tiny ViT variant against larger baselines (e.g., ViT-Base) but do not report smaller configurations of those methods. A fairer comparison would use matched model sizes (e.g., ViT-Tiny or equivalent small-scale backbones). 4. The paper lacks strong justification for why DEFOCA’s regularization mechanism leads to superior generalization beyond mild smoothing effects, especially when similar results could arise from existing data augmentation or attention regularizers. 1. How statistically significant are the reported improvements across runs and datasets? 2. Can DEFOCA outperform simple augmentations like CutMix or DropBlock under controlled comparisons? 3. Have you evaluated smaller versions of other SOTA methods for fair model-scale comparison? 4. How sensitive is DEFOCA to the number of stochastic views (V = 8) and blur strength σ? Fully AI-generated
Blur to Focus Attention in Fine-Grained Visual Recognition Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Submission 1220 proposes DEFOCA, a soft, attention-like mechanism for FGVR. It functions by blurring some randomly selected patch patterns of the image during training, aiming to strengthen the learning process of subtle discriminative features. DEFOCA achieves fair results with light parameter amount against existing baseline methods. 1. FGVR is a relatively old topic, which is already thoroughly explored during the CNN-dominated period before 2019. The authors need to validate the necessity and practical value of their additional research investment in such area. 2. The references are in general out-of-date. Most of them are more than 3 years ago, let alone some are more than 10 years ago. 3. The method is still highly similar to MIM. In fact, compared to MIM, there is no essential difference by merely changing masking to blurring. 4. It is hard to see significant improvement among Table 1 & 2. For Table 1, DEFOCA fails to achieve SOTA results; and even compared to a baseline with similar parameter amount (e.g. FIDO), there is no significant advantage. For Table 2, it is confusing that DEFOCA is marked bold face in some columns of SoyAging without the best result. Please see the weaknesses part. Fully human-written
Blur to Focus Attention in Fine-Grained Visual Recognition Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes DEFOCA, a lightweight “blur-to-focus” layer that partitiones images into patches and stochastically applies Gaussian blur to selected patches during training to encourage a network to rely on unblurred discriminative regions. The authors supply a theoretical analysis and extensive experiments on fine-grained and ultra-fine-grained datasets - DEFOCA is architecture-agnostic, applied on-the-fly after usual augmentations, and removed at test time, which makes it easy to adopt. - Results reported across standard FGVR datasets (CUB-200, Stanford Cars, NABirds, FGVC-Aircraft) and several ultra-fine-grained datasets show consistent gains. - The paper gives a concise theory grounding (label-safety probability, Lemma 1 on expected representation drift, Proposition 1 SNR argument) that clarifies why contiguous patch selection is beneficial. - Lemma 1 and subsequent bounds rely on an L-Lipschitz assumption for the feature map and an ad-hoc large constant M for when discriminative patches are altered. The analysis is conceptually fine, but it’s not clear how realistic the Lipschitz assumption is for modern deep nets or how large M is in practice. - Could DEFOCA be applied to current SOTA baselines and generate better performance? Please refer to the weakness above. Lightly AI-edited
Blur to Focus Attention in Fine-Grained Visual Recognition Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes a method called DEFOCA, a training-time, patch-wise Gaussian blur layer placed after standard augmentations. At each iteration, it selects image patches (random, dispersed, or contiguous) and blurs them to create stochastic views, aiming to suppress background/high-frequency noise and nudge the network toward discriminative regions. The authors provide an interpretation as implicit attention, a combinatorial label-safety argument, a bound on representation drift, and a PAC-Bayes generalization bound. Empirically, DEFOCA is plugged into ResNet and Tiny-ViT backbones and evaluated on various FGVC and UFGVC datasets. It is removed at test time. Results are generally competitive in comparison to Tiny-ViT baselines. Motivation for FGVC fragility (small discriminative area, pose/occlusion/clutter) is well framed, and figures are illustrative. A simple, architecture-agnostic mechanism. Patch-blur as a layer is easy to adopt and integrates with standard backbones, with no labels or architectural edits needed. Consistent gains over Tiny-ViT/ResNet baselines across multiple datasets; qualitative maps/t-SNE show tighter clusters and more focused regions. The paper varies patch layouts (random/contiguous/dispersed), operations (low-/high-pass, noise, colour jitter), and key hyperparameters (grid size P, ratio n/N, blur $\sigma$), with contiguous low-pass emerging as best. No test-time cost or architectural brittleness; ablations suggest robustness to reasonable hyperparameter ranges. Lack of novelty as it is close to known ideas (Cutout [1], Hide-and-Seek[2], DropBlock[3], Random Erasing[4], etc). The core is a localized degradation augmentation. The differentiator is “blur not mask” plus “contiguous patches,” but this is still an augmentation variant rather than a new learning principle. The paper should position against these families much more rigorously (same backbones, same training budgets, tuned strengths), not only via narrative. To claim principled superiority, DEFOCA should be compared head-to-head against Cutout[1], Hide-and-Seek [2], DropBlock[3], Random Erasing[4], Mixup[5], CutMix [6] (and their FGVC-tuned strengths), RandAugment [7], TrivialAugment [8], AugMix[9], and a global Gaussian-blur-probability baseline under the same schedule/backbone. Current comparison tables exclude state-of-the-art patch-driven methods [10-12] whose FGVC performance is significantly higher (Aircraft > 96%, CUB-200 > 92%, Cars > 96%, NABirds > 92%) than the proposed approach, making SOTA positioning ambiguous. There are many SOTA approaches in [10] whose performances are significantly higher than the proposed approach, and they are excluded from the list. It is good to have a comparison with the SOTA approaches to justify. While attention maps look sharper, there’s no causal test that DEFOCA improves causal localization (e.g., counterfactual part perturbations, deletion/insertion metrics). The theory assumes a hidden set of discriminative patches S and argues that contiguous selection maximizes label-safety. But S is unobserved, no estimator is used, and no sensitivity to misspecification is analyzed. The “contiguous maximizes Psafe” statement is asserted, not proven under realistic image statistics. The drift bound assumes per-patch Lipschitz behaviour and an M >> Ln gap when discriminative patches are hit. This is plausible but not verified empirically. The PAC-Bayes inequality is boilerplate and does not yield a sharper or measurable bound tied to DEFOCA’s specifics (e.g., to $\sigma$, n/N, or contiguity). There’s no human-part or saliency-overlap evaluation showing blurred regions avoid critical parts more often with a contiguous layout vs a random. Claims about occlusion/pose/clutter robustness would be stronger with corruption/occlusion suites (e.g., defocus/zoom blur, occludes) and partial visibility tests, plus calibration under shift. Qualitative maps are suggestive but insufficient. The paper sets V=8 views but does not show the accuracy vs compute curve as V varies, or the effect of $\sigma$ when it does touch discriminative parts. Also missing: train-time cost, and analysis of interactions with standard augmentations (is DEFOCA still helpful atop RandAugment+Mixup+CutMix?) It’s not fully clear that every baseline uses the same augmentation stack, epochs, parameter count, and tuning budget. Without a unified training protocol table, fairness is hard to judge. [1] DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with Cutout. arXiv:1708.04552. [2] Singh, K.K., Yu, H., Sarmasi, A., Pradeep, G. and Lee, Y.J., 2018. Hide-and-seek: A data augmentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545. ICCV 2017. [3] Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018). DropBlock: A regularization method for convolutional networks. NeurIPS. [4] Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random Erasing Data Augmentation. AAAI. [5] Zhang, H., Cisse, M., Dauphin, Y., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. ICLR. [6] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to train strong classifiers with localizable features. ICCV. [7] Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). RandAugment: Practical automated data augmentation with a reduced search space. CVPR Workshops. [8] Müller, S. G., & Hutter, F. (2021). TrivialAugment: Tuning-free yet state-of-the-art data augmentation. ICCV. [9] Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., & Lakshminarayanan, B. (2020). AugMix: A simple data processing method to improve robustness and uncertainty. ICLR. [10] Sikdar, A., Liu, Y., Kedarisetty, S., Zhao, Y., Ahmed, A. and Behera, A., 2025. Interweaving insights: High-order feature interaction for fine-grained visual recognition. International Journal of Computer Vision, 133(4), pp.1755-1779. [11] Behera, A., Wharton, Z., Hewage, P., & Bera, A. (2021). Context-aware attentional pooling (CAP) for fine-grained visual classification. AAAI conference on artificial intelligence (pp. 929–937) [12] Bera, A., Wharton, Z., Liu, Y., Bessis, N., & Behera, A. (2022). SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Transactions on Image Processing, 31, 6017–6031 How does DEFOCA compare to the SOTA approaches in references [10-12]? How does DEFOCA compare, under identical backbones/schedules, to Cutout, Hide-and-Seek, DropBlock, Random Erasing, Mixup/CutMix, RandAugment/TrivialAugment/AugMix, and global random Gaussian blur with tuned probability and kernel? What is the accuracy vs V curve (e.g., 1/2/4/8/16 views)? What is the train-time overhead (wall-clock, GPU hours) attributable to DEFOCA as V increases? Your theory hinges on avoiding discriminative patches. What happens when blur does overlap S (e.g., high $\sigma$, small patches)? Please provide sensitivity sweeps ($\sigma$, n/N, P) with attention-overlap metrics (e.g., % of CAM/Grad-CAM mass blurred) and accuracy drops. Can you add corruption/occlusion evaluations (defocus/zoom blur, cutout occludes) and partial-visibility protocols to quantify robustness? You assert that contiguous maximizes label-safety. Can you formalize the distributional assumptions under which this is true and add an empirical check? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next