|
Blur to Focus Attention in Fine-Grained Visual Recognition |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes a method called DEFOCA, a training-time, patch-wise Gaussian blur layer placed after standard augmentations. At each iteration, it selects image patches (random, dispersed, or contiguous) and blurs them to create stochastic views, aiming to suppress background/high-frequency noise and nudge the network toward discriminative regions. The authors provide an interpretation as implicit attention, a combinatorial label-safety argument, a bound on representation drift, and a PAC-Bayes generalization bound. Empirically, DEFOCA is plugged into ResNet and Tiny-ViT backbones and evaluated on various FGVC and UFGVC datasets. It is removed at test time. Results are generally competitive in comparison to Tiny-ViT baselines.
Motivation for FGVC fragility (small discriminative area, pose/occlusion/clutter) is well framed, and figures are illustrative.
A simple, architecture-agnostic mechanism. Patch-blur as a layer is easy to adopt and integrates with standard backbones, with no labels or architectural edits needed.
Consistent gains over Tiny-ViT/ResNet baselines across multiple datasets; qualitative maps/t-SNE show tighter clusters and more focused regions.
The paper varies patch layouts (random/contiguous/dispersed), operations (low-/high-pass, noise, colour jitter), and key hyperparameters (grid size P, ratio n/N, blur $\sigma$), with contiguous low-pass emerging as best.
No test-time cost or architectural brittleness; ablations suggest robustness to reasonable hyperparameter ranges.
Lack of novelty as it is close to known ideas (Cutout [1], Hide-and-Seek[2], DropBlock[3], Random Erasing[4], etc). The core is a localized degradation augmentation. The differentiator is “blur not mask” plus “contiguous patches,” but this is still an augmentation variant rather than a new learning principle. The paper should position against these families much more rigorously (same backbones, same training budgets, tuned strengths), not only via narrative.
To claim principled superiority, DEFOCA should be compared head-to-head against Cutout[1], Hide-and-Seek [2], DropBlock[3], Random Erasing[4], Mixup[5], CutMix [6] (and their FGVC-tuned strengths), RandAugment [7], TrivialAugment [8], AugMix[9], and a global Gaussian-blur-probability baseline under the same schedule/backbone.
Current comparison tables exclude state-of-the-art patch-driven methods [10-12] whose FGVC performance is significantly higher (Aircraft > 96%, CUB-200 > 92%, Cars > 96%, NABirds > 92%) than the proposed approach, making SOTA positioning ambiguous. There are many SOTA approaches in [10] whose performances are significantly higher than the proposed approach, and they are excluded from the list. It is good to have a comparison with the SOTA approaches to justify.
While attention maps look sharper, there’s no causal test that DEFOCA improves causal localization (e.g., counterfactual part perturbations, deletion/insertion metrics).
The theory assumes a hidden set of discriminative patches S and argues that contiguous selection maximizes label-safety. But S is unobserved, no estimator is used, and no sensitivity to misspecification is analyzed. The “contiguous maximizes Psafe” statement is asserted, not proven under realistic image statistics.
The drift bound assumes per-patch Lipschitz behaviour and an M >> Ln gap when discriminative patches are hit. This is plausible but not verified empirically. The PAC-Bayes inequality is boilerplate and does not yield a sharper or measurable bound tied to DEFOCA’s specifics (e.g., to $\sigma$, n/N, or contiguity).
There’s no human-part or saliency-overlap evaluation showing blurred regions avoid critical parts more often with a contiguous layout vs a random.
Claims about occlusion/pose/clutter robustness would be stronger with corruption/occlusion suites (e.g., defocus/zoom blur, occludes) and partial visibility tests, plus calibration under shift. Qualitative maps are suggestive but insufficient.
The paper sets V=8 views but does not show the accuracy vs compute curve as V varies, or the effect of $\sigma$ when it does touch discriminative parts. Also missing: train-time cost, and analysis of interactions with standard augmentations (is DEFOCA still helpful atop RandAugment+Mixup+CutMix?)
It’s not fully clear that every baseline uses the same augmentation stack, epochs, parameter count, and tuning budget. Without a unified training protocol table, fairness is hard to judge.
[1] DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with Cutout. arXiv:1708.04552.
[2] Singh, K.K., Yu, H., Sarmasi, A., Pradeep, G. and Lee, Y.J., 2018. Hide-and-seek: A data augmentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545. ICCV 2017.
[3] Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018). DropBlock: A regularization method for convolutional networks. NeurIPS.
[4] Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random Erasing Data Augmentation. AAAI.
[5] Zhang, H., Cisse, M., Dauphin, Y., & Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. ICLR.
[6] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to train strong classifiers with localizable features. ICCV.
[7] Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). RandAugment: Practical automated data augmentation with a reduced search space. CVPR Workshops.
[8] Müller, S. G., & Hutter, F. (2021). TrivialAugment: Tuning-free yet state-of-the-art data augmentation. ICCV.
[9] Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., & Lakshminarayanan, B. (2020). AugMix: A simple data processing method to improve robustness and uncertainty. ICLR.
[10] Sikdar, A., Liu, Y., Kedarisetty, S., Zhao, Y., Ahmed, A. and Behera, A., 2025. Interweaving insights: High-order feature interaction for fine-grained visual recognition. International Journal of Computer Vision, 133(4), pp.1755-1779.
[11] Behera, A., Wharton, Z., Hewage, P., & Bera, A. (2021). Context-aware attentional pooling (CAP) for fine-grained visual classification. AAAI conference on artificial intelligence (pp. 929–937)
[12] Bera, A., Wharton, Z., Liu, Y., Bessis, N., & Behera, A. (2022). SR-GNN: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Transactions on Image Processing, 31, 6017–6031
How does DEFOCA compare to the SOTA approaches in references [10-12]?
How does DEFOCA compare, under identical backbones/schedules, to Cutout, Hide-and-Seek, DropBlock, Random Erasing, Mixup/CutMix, RandAugment/TrivialAugment/AugMix, and global random Gaussian blur with tuned probability and kernel?
What is the accuracy vs V curve (e.g., 1/2/4/8/16 views)? What is the train-time overhead (wall-clock, GPU hours) attributable to DEFOCA as V increases?
Your theory hinges on avoiding discriminative patches. What happens when blur does overlap S (e.g., high $\sigma$, small patches)?
Please provide sensitivity sweeps ($\sigma$, n/N, P) with attention-overlap metrics (e.g., % of CAM/Grad-CAM mass blurred) and accuracy drops.
Can you add corruption/occlusion evaluations (defocus/zoom blur, cutout occludes) and partial-visibility protocols to quantify robustness?
You assert that contiguous maximizes label-safety. Can you formalize the distributional assumptions under which this is true and add an empirical check? |
Fully AI-generated |