ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 4.50 12364
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 4.00 2632
Total 4 (100%) 4.00 4.25 7498
Title Ratings Review Text EditLens Prediction
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new synthetic image detector with contributions to the model architecture, adaptations to continual learning and a new synthetic image benchmark that contains 3D Gaussian Splatting (3DGS) rendered images. The evaluation is twofold, including both a binary supervised detection task and its continual learning variant with a HSIC-Guided Replay (HGR) adaptation. The proposed model achieves state-of-the-art performance in cross-generator evaluation, generalizing between diffusion-based and GAN-based images, with an improvement of over 5 percentage points. Moreover, it demonstrates strong continual learning capability when incrementally trained to detect 3DGS-generated images. - The method achieves state-of-the-art performances, especially in the cross-generators evaluation setup. Furthermore, results in the continual learning setup improve over the single-dataset training baseline and, in some cases, even surpass those obtained by jointly training on the additional 3DGS datasets. - The inclusion of a 3DGS benchmark is a valuable addition, introducing a new family of synthetic image generation methods beyond GANs and diffusion models which have dominated the detection research. - The paper reinforces the effective use of HSIC in both supervised and continual learning setups. - The HSIC bottleneck is not entirely novel; it can be seen as a combination of the RINE[1] and DualHSIC[2] approaches. RINE’s performance is missing from Table 1, which could potentially narrow the gap between the current model and the top-performing prior works reported in the same table. - The performance gains of the HSIC term in HGR is uncertain. Ablation on the performance gains due to inclusion of \( 1 - \mathcal{N}(r_i) \) term in Equation 10 would help justify its contribution to the overall performance and clarify its impact. Typo: In Table 5 b) The Cosine kernel achieves highest mACC on ProGAN and should be bolded instead of median version of RBF. - While the paper improves performance on the GenImage benchmark, the core method relies heavily on existing approaches and therefore provides limited new contributions to the synthetic image detection community. However, the inclusion of 3DGS samples in the continual learning setup represents a significant strength supporting acceptance. To further strengthen the paper, the authors should better motivate the method’s novelty and its relevance to the community. Additionally, a useful way to justify the method’s performance would be to compare its cross-generator performance on 3DGS with previous methods (using the base method results from Tables 3 and 4, which show a significant gap between 3DGS and Diffusion or GAN generators). [1] Christos Koutlis and Symeon Papadopoulos. “Leveraging representations from intermediate encoder-blocks for synthetic image detection.” ECCV 2024 [2] Zifeng Wang, Zheng Zhan, Yifan Gong, Yucai Shao, Stratis Ioannidis, Yanzhi Wang, and Jennifer Dy. “DualHSIC: HSIC-bottleneck and alignment for continual learning.” ICML 2023 - What’s the difference between your method (refferred to as Ours in Table 1 and 2) and DualHSIC with a CLIP backbone? - What architecture does the classifier (g_{\theta_g}) has? - From where do real samples from the 3DGS datasets come from? Fully human-written
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Authors propose a new bottleneck loss for synthetic image detection, based on HSIC. The method computes the Hilbert-Schmidt Independence Criterion (HSIC) on the image and label encoded embeddings, and add that to the binary-cross entropy loss already used. For the domain-incremental setting, HSIC is used to guide replay. The work includes experiments comparing on cross-generator generalization and continual adaptation, as well as an ablation on the HSIC components and an analysis of the domain-incremental learning. S1) The intuition portions of the paper are fairly easy to read. S2) The t-SNE plots are nice included analysis. S3) The method seems mathematically well-grounded. W1) In the DIL setting, the comparison methods are out-of-date (the newest being from 2020). This makes it unclear how the presented method compares with SOTA. W2) In the cross-generator generalization setting, the chosen models are also out-of-date (the newest being from 2022). It would be much more relevant to test on the SOTA generative models being used today, to understand how applicable this method is in practice (e.g. FLUX, Qwen-Image, etc). W3) The related work section is also not in-depth enough and out-dated in some places, making it difficult to place the work within contemporary literature. For example, in section 2.2. (Continual Learning related works), the newest method is from 2021, while much newer work exists, e.g. [A]. W4) The mathematical background section (2.3) misses explicitly defining some mathematical notation (variables and functions), which would be useful for improving the clarity for readers. Most notably 1, but also e.g. tr and I would be useful, for completeness. W5) An ablation over the choice of replay would be useful in understanding its role, given it is part of the proposed methodology. [A] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need, Wang et al., CVPR 2025. None at this time Fully human-written
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper addresses two critical challenges in synthetic image detection: poor cross-generator generalization and catastrophic forgetting in domain-incremental learning. To tackle these issues, the authors propose two core components: (1) an HSIC (Hilbert-Schmidt Independence Criterion) bottleneck applied to intermediate CLIP ViT features, which suppresses text-image alignment semantics (irrelevant to authenticity) while enhancing discriminative representations for real vs. synthetic images; (2) HSIC-Guided Replay (HGR), a rehearsal strategy that selects per-class exemplars via a hybrid score combining HSIC relevance (information centrality) and k-center coverage (spatial diversity), mitigating forgetting during domain adaptation. Additionally, the authors curate a 3D Gaussian Splatting (3DGS) head avatar benchmark dataset, covering multi-view reconstruction, single-view reconstruction, and generative pipelines, to support domain-incremental evaluation. Empirical evaluations are conducted in two phases: Phase I tests cross-generator transfer between diffusion and GAN models, and Phase II assesses sequential adaptation to 3DGS domains. Results show the HSIC bottleneck improves cross-generator generalization, while HGR sustains prior-domain accuracy during 3DGS adaptation. The paper's main contributions include the HSIC bottleneck design, the HGR rehearsal mechanism, and the 3DGS benchmark dataset. (1) The HSIC bottleneck innovatively leverages intermediate CLIP features to resolve the interference of text-image alignment semantics (a key limitation of CLIP-based detectors), and its combination with information-theoretic regularization (minimizing input dependence, maximizing label dependence) is theoretically grounded and practically effective. (2) HGR addresses the inefficiency of traditional replay methods by fusing HSIC relevance (ensuring exemplar informativeness) and k-center coverage (ensuring diversity), achieving compact memory usage while mitigating forgetting--filling a gap in domain-incremental synthetic image detection. (3) The 3DGS head avatar dataset (with identity-disjoint splits and standardized preprocessing) addresses the lack of benchmarks for rendered synthetic images, supporting research on domain-incremental adaptation to 3D-generated content. (4) The two-phase evaluation (cross-generator generalization + domain-incremental learning) covers diverse scenarios (diffusion, GAN, 3DGS). The authors compare against 6+ baselines (e.g., CNNSpot, LGrad, UniFD, iCaRL) and conduct detailed ablations (HSIC components, kernel choices, intermediate features), verifying the necessity of each module. (5) Ablation studies confirm the role of HSIC(x,z) (suppressing input shortcuts) and HSIC(y,z) (aligning with labels), while t-SNE visualizations qualitatively demonstrate that the HSIC bottleneck reshapes features into more separable real/synthetic clusters--strengthening the credibility of the proposed method. (1) The paper lacks critical implementation specifics. For example, regarding the HSIC bottleneck: the authors mention a "64-D projection" but do not specify the projection layer's structure (e.g., fully connected layer with activation function? Number of neurons in hidden layers, if any?). For training parameters: the learning rate is set to \(10^{-4}\) (SGD), but no details are provided on batch size, number of training epochs, weight decay, or learning rate scheduling (e.g., step decay, cosine annealing)--parameters that directly impact model convergence and performance. For data preprocessing: while "standardized preprocessing" is mentioned for the 3DGS dataset, there is no description of specific steps (e.g., image resizing resolution, normalization mean/std values, whether face cropping is applied for head avatars). Without these details, other researchers cannot replicate the experiments, violating the reproducibility principles of academic research. (2) The paper's theoretical foundation for the HSIC bottleneck and HGR is insufficient. For the HSIC bottleneck: Equation (6) defines the loss function, but the authors do not analyze its convergence properties (e.g., whether the loss decreases monotonically during training, or under what conditions the model converges to a global optimum). There is also no discussion of why minimizing HSIC(x,z) (input-feature dependence) effectively suppresses text-alignment semantics--only qualitative t-SNE results are provided, lacking quantitative evidence (e.g., semantic similarity scores between features and text captions before/after applying the bottleneck). For HGR: the authors claim the hybrid score (HSIC relevance + k-center coverage) improves exemplar selection, but there is no theoretical justification for why this combination outperforms single-criterion methods (e.g., pure HSIC or pure k-center). For instance, no proof is given that HSIC relevance correlates with exemplar informativeness, or that k-center coverage effectively reduces redundancy. This weakens the method's theoretical rigor. (3) The evaluation is limited to specific scenarios, failing to test the method's robustness across broader conditions. First, **dataset scope limitation**: The 3DGS benchmark only focuses on head avatars, with no evaluation on 3D-generated non-face scenes (e.g., 3DGS-rendered landscapes, objects). This raises questions about whether the method generalizes to other 3D-rendered content. Second, **synthetic image diversity limitation**: Cross-generator evaluation only includes classic diffusion models (e.g., SDV1.4, ADM) and GANs (e.g., ProGAN, StyleGAN), but not recent variants (e.g., Stable Diffusion 3, GANformer) or hybrid models (e.g., diffusion-GAN hybrids). Third, **image quality robustness**: There is no evaluation of detection performance on low-resolution synthetic images (e.g., 32×32, 64×64) or images subjected to post-processing (e.g., JPEG compression, Gaussian blur, rotation)--common in real-world scenarios. Fourth, **adversarial robustness**: The paper does not test whether the method retains performance under adversarial attacks (e.g., FGSM, PGD attacks on synthetic images to evade detection), a critical consideration for practical deployment. (4) While the paper compares against multiple baselines, several critical comparisons are missing or insufficient. First, **latest method omissions**: The paper cites baselines up to 2025 (e.g., VIB-Net, 2025) but does not compare against any 2025-post methods (e.g., diffusion-specific detectors or 3DGS-focused detection methods) that may have addressed similar problems. Second, **unclear baseline parameter consistency**: For baselines like UniFD and NPR, the authors do not confirm whether they used the official implementations or default parameters--if the baselines were not optimized (e.g., using suboptimal hyperparameters), the comparison results may overstate the proposed method's advantages. Third, **incomplete cross-method ablation**: For example, when comparing HGR with iCaRL and CBRS, the paper does not conduct ablation studies on combining HGR with other rehearsal strategies (e.g., iCaRL's class-mean herding) to test for synergies. Fourth, **computational efficiency comparison**: No comparison of inference time or training memory usage between the proposed method and baselines is provided--critical for practical deployment (e.g., on edge devices). (5) Key results are presented unclearly or incompletely, hindering result verification. First, **table data gaps**: Tables 1 and 2 (cross-generator generalization results) contain empty cells (e.g., Table 1's "| 61.61/83.59 60.74/90.24 48.82/47.51 61.43/82.74 | 58.65/51.77 60.30/83.30 89.70/96.59 97.54/99.64 99.49/99.99 88.60/98.44 | | | |") and missing dataset labels for some columns, making it impossible to determine which targets the results correspond to. Second, **lack of quantitative clustering analysis**: While t-SNE visualizations (Figures 2, 6, 7) show qualitative improvements in real/synthetic separation, no quantitative metrics (e.g., silhouette coefficient, Davies-Bouldin index, or inter/intra-cluster distance ratios) are provided to measure clustering quality--weakening the evidence for feature reshaping. Third, **insufficient statistical significance**: Most results report mean accuracy/mAP but lack standard deviations (except in Figure 3) or confidence intervals. For example, Table 3 and 4 (domain-incremental results) do not specify how many runs were averaged, or whether differences between methods are statistically significant (e.g., via t-tests). Fourth, **parameter sensitivity analysis gaps**: The HSIC bottleneck uses λx=900/500 and λy=700/600 for SDV1.4/ProGAN training, but no sensitivity analysis is provided (e.g., how performance changes when λx/λy varies by ±20%, ±50%). Similarly, HGR's λkc (controlling k-center weight) only tests "λkc=0" and "larger values"--no gradient-based analysis of optimal λkc for different datasets. (6) The domain-incremental phase only evaluates adaptation to 3DGS domains, with several limitations. First, **limited domain diversity**: No adaptation to other emerging synthetic domains (e.g., text-to-video frame extracts, neural radiance field (NeRF)-rendered images) is tested, raising questions about HGR's generalizability to non-3DGS domains. Second, **short adaptation sequence**: Only 3 3DGS sub-domains are used (GHA, SA, GAGAvatar)--no evaluation of long-sequence adaptation (e.g., 5+ domains) to test cumulative forgetting. Third, **fixed memory budget**: The paper uses a fixed keep_frac=0.01 (1% of training samples) for the replay buffer but does not test how memory size impacts performance (e.g., keep_frac=0.005, 0.02) or compare against dynamic memory allocation strategies. Fourth, **no backward transfer analysis**: Backward transfer (improvement in prior-domain performance after adapting to new domains) is a key metric for continual learning, but the paper only reports "preserving prior accuracy" without quantifying backward transfer--failing to fully demonstrate HGR's advantages over baselines. (7) The paper does not acknowledge or discuss the proposed method's inherent limitations. First, **backbone dependence**: The method relies on CLIP ViT features, but no analysis is provided of performance degradation when using lighter backbones (e.g., MobileNet, EfficientNet) for edge deployment. Second, **data imbalance impact**: The 3DGS dataset uses balanced real/synthetic splits (e.g., GHA: 45,772 real / 45,772 synthetic), but no test of imbalanced splits (e.g., 1:10 real:synthetic) is conducted--common in real-world scenarios where synthetic images may be more abundant. Third, **modal limitation**: Only single-image detection is supported, with no extension to multi-modal synthetic data (e.g., synthetic images with text overlays, audio-synced synthetic video frames). Fourth, **computational overhead of HSIC**: HSIC calculation requires Gram matrix construction and centering, which increases computational complexity--no quantification of training/inference time overhead compared to non-HSIC methods (e.g., how much slower the HSIC bottleneck is than a standard CLIP linear probe). (8) The related work section has gaps and superficial comparisons. First, **HSIC application gaps**: The paper cites HSIC (Gretton et al., 2005) but does not discuss recent HSIC applications in computer vision (e.g., HSIC for domain adaptation, few-shot learning) or compare how its HSIC bottleneck differs from existing HSIC-based feature regularization methods. Second, **continual learning omissions**: Key rehearsal-based methods (e.g., Memory Replay GANs, Contrastive Replay) are not cited, and no discussion of how HGR differs from contrastive exemplar selection methods is provided. Third, **3DGS detection gaps**: No discussion of existing 3DGS/rendered image detection methods (if any) is provided--failing to position the paper's 3DGS benchmark within the broader literature. Fourth, **superficial baseline analysis**: For baselines like VIB-Net (2025), the paper only states it "uses a variational information bottleneck" but does not compare the HSIC bottleneck (information-theoretic) with VIB (probabilistic) in terms of theoretical framework or performance--missing an opportunity to highlight the HSIC bottleneck's advantages. (9) The paper states the HSIC bottleneck "concatenates features from 24 intermediate CLIP ViT layers and the final layer" but provides no details on aggregation. First, **layer selection rationale**: No explanation is given for choosing 24 intermediate layers (e.g., why not 12, 36 layers?) or which specific layers (e.g., early, middle, late) are selected. Second, **aggregation method**: Concatenation may lead to high dimensionality (e.g., 25 layers × 768 dim (ViT-B) = 19,200 dim), but no dimensionality reduction (e.g., PCA, t-SNE) or feature fusion (e.g., attention-based fusion) is mentioned--raising questions about computational efficiency and redundancy. Third, **layer-wise contribution**: No ablation of individual layer contributions (e.g., removing early layers) is conducted--failing to identify which layers are most critical for detection. (10) Qualitative results (e.g., t-SNE, sample images) are not fully analyzed. First, **t-SNE interpretation gaps**: Figures 6 and 7 (t-SNE of CLIP vs. HSIC features) show "tighter clusters" but do not explain why some datasets (e.g., GauGAN in Figure 7) still have overlapping clusters--failing to address the method's limitations for specific generators. Second, **no failure case analysis**: No discussion of misclassified samples (e.g., why some real images are mislabeled as synthetic) or analysis of common artifacts in misclassified synthetic images--critical for guiding future improvements. Third, **3DGS sample visualization**: The paper mentions Figure 1 (3DGS sample images) but does not provide qualitative comparisons of detection performance across 3DGS sub-domains (e.g., why SA has higher accuracy than GAGAvatar)--missing insights into domain-specific challenges. **To facilitate discussions during the Rebuttal phase, authors are advised to respond point-by-point (indicating the question number).** (1) Could you provide the following critical implementation details to ensure reproducibility? (a) The exact architecture of the HSIC bottleneck's projection layer (e.g., number of fully connected layers, activation functions, output dimension); (b) Full training hyperparameters (batch size, number of epochs, weight decay, learning rate scheduler, optimizer momentum); (c) Specific data preprocessing steps (image resolution, normalization parameters, face cropping logic for 3DGS avatars); (d) Code for HSIC calculation (e.g., Gaussian RBF kernel bandwidth calculation via median heuristic, Gram matrix centering implementation). (2) (a) Could you provide a formal analysis of the HSIC bottleneck's convergence (e.g., proof of loss monotonicity or bounds on generalization error)? (b) How do you quantitatively verify that the HSIC bottleneck suppresses text-alignment semantics? For example, using cosine similarity between CLIP features and text embeddings (e.g., "face" captions) before/after applying the bottleneck. (c) Could you provide a theoretical justification for combining HSIC relevance and k-center coverage in HGR (e.g., a bound on the expected error reduction compared to single-criterion selection)? (3) (a) Did you use official implementations and default hyperparameters for baselines (e.g., UniFD, NPR, VIB-Net)? If not, what modifications were made, and why? (b) Could you add comparisons with 2025-post synthetic image detection methods (e.g., any new diffusion-specific detectors or 3DGS detection methods)? (c) Could you provide computational efficiency metrics (inference time per image, training memory usage) for your method and baselines on the same hardware (e.g., NVIDIA RTX 4090)? (4) (a) Could you fill in the missing cells in Tables 1 and 2 and clarify dataset labels for all columns? (b) Could you add quantitative clustering metrics (silhouette coefficient, Davies-Bouldin index) for t-SNE visualizations (Figures 2, 6, 7) to quantify real/synthetic separation? (c) Could you provide standard deviations and 95% confidence intervals for all reported mean accuracy/mAP values, along with the number of independent runs (e.g., 5 runs)? (5) (a) Could you conduct a sensitivity analysis of HSIC's λx and λy (e.g., λx=300, 700, 1100; λy=500, 700, 900) for SDV1.4 and ProGAN, and plot performance trends? (b) Could you test multiple values of HGR's λkc (e.g., 0.1, 0.5, 1.0, 2.0) and analyze how it impacts exemplar selection and domain-incremental performance? (c) Could you explain the rationale for choosing keep_frac=0.01 and test keep_frac=0.005, 0.02 to show memory-size impact? (6) (a) Could you extend the domain-incremental evaluation to include non-3DGS domains (e.g., NeRF-rendered images, SDv3-generated images) to test HGR's generalizability? (b) Could you evaluate long-sequence adaptation (e.g., 5+ domains) and report cumulative forgetting curves? (c) Could you quantify backward transfer (using the formula: Backward Transfer = (Accuracy after new domain - Accuracy before new domain) / Accuracy before new domain) for all prior domains? (7) (a) Could you evaluate detection performance on low-resolution synthetic images (32×32, 64×64) and post-processed images (JPEG compression: quality 20, 50; Gaussian blur: σ=1, 3)? (b) Could you test adversarial robustness using FGSM/PGD attacks (ε=0.01, 0.03) and report accuracy degradation? (c) Could you test performance on imbalanced real/synthetic splits (1:5, 1:10) and compare against rebalancing strategies (e.g., class weights)? (8) (a) Could you test the HSIC bottleneck on lighter backbones (MobileNetV3, EfficientNet-B0) and report performance vs. efficiency trade-offs? (b) Could you extend the method to multi-modal data (e.g., synthetic images with text overlays) by fusing text and image features in the HSIC bottleneck? (c) Could you provide ablation results for CLIP layer selection (e.g., only late layers, only middle layers) to identify the most critical layers for detection? (9) (a) What are the main limitations of the HSIC bottleneck in practical deployment (e.g., computational cost, backbone dependence)? How do you plan to address them in future work? (b) How does the method perform when synthetic images are designed to mimic real-image statistics (e.g., adversarial synthetic images)? (c) Could you discuss scenarios where the method fails (e.g., specific generators, image types) and provide failure case examples? (10) (a) Could you explain the "identity-disjoint split" implementation for the 3DGS dataset (e.g., how identities were labeled, tools used for identity verification)? (b) Could you provide the exact sample counts for each sub-dataset in the GAN/diffusion evaluation (e.g., ProGAN: 10k samples, SDV1.4: 15k samples)? (c) Could you release the 3DGS dataset (or a sample subset) and provide access links to facilitate further research? (11) (a) Could you quantify the training time overhead of the HSIC bottleneck compared to a standard CLIP linear probe (e.g., % increase in epochs per hour)? (b) Could you propose optimizations for HSIC calculation (e.g., batch-wise Gram matrix computation) to reduce overhead? (12) (a) Could you compare HGR's forward transfer (performance on new domains) with baselines (iCaRL, CBRS) using quantitative forward transfer metrics? (b) Could you analyze why HGR performs better on SA than GAGAvatar in Table 4? Are there domain-specific artifacts that HGR captures more effectively? (13) (a) Could you test alternative feature aggregation methods (e.g., attention-based fusion, average pooling) for CLIP intermediate layers and compare performance with concatenation? (b) Could you provide a dimensionality analysis of the concatenated features (24 intermediate + 1 final layer) and explain how you avoid overfitting due to high dimensionality? (14) (a) Could you discuss the method's potential deployment scenarios (e.g., social media content moderation, forensics) and any practical challenges (e.g., real-time inference, scalability)? (b) Could you test the method on a real-world dataset (e.g., Reddit synthetic image subsets) with uncurated synthetic/real images? (15) (a) Could you conduct a direct comparison of the HSIC bottleneck and VIB-Net's variational bottleneck (e.g., performance on the same test sets, computational cost, robustness to noise)? (b) Could you explain why the HSIC bottleneck is more effective at suppressing text-alignment semantics than VIB? Fully AI-generated
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a HSIC-based bottleneck to enhance the generalization of CLIP features for synthetic image detection across diverse generator families. The method encourages representations to retain label-relevant information while suppressing spurious correlations with input semantics, and further introduces HSIC-Guided Replay (HGR) to mitigate catastrophic forgetting in domain-incremental learning. Experiments on diffusion, GAN, and 3D Gaussian Splatting (3DGS) models demonstrate improved cross-generator performance and continual learning stability compared to recent baselines. - The paper articulates the challenge of detectors overfitting to generator-specific artifacts and semantics, which is a timely and relevant problem for synthetic image forensics. - The continual learning setting further shows practical awareness of evolving generative models, making the work meaningful for long-term applicability. - The adaptation of HSIC into the CLIP feature pipeline is implemented in a straightforward and principled manner, and the loss formulation is coherent with prior HSIC-based bottleneck approaches. - Ablation studies on HSIC components, use of intermediate ViT features, and kernel options provide supportive empirical evidence that each design choice is beneficial. - The evaluation spans distinct generative paradigms, and the method shows consistent gains across them, indicating improved robustness of the learned representations. - The 3DGS results highlight the method’s applicability to emerging synthetic formats beyond classical image generators. - While effective, the core contribution largely builds on established concepts such as HSIC-based information bottlenecks and CLIP feature refinement. The paper does not sufficiently articulate what is fundamentally novel beyond applying HSIC to a different backbone and combining it with a replay mechanism. As a result, the contribution may be perceived as incremental rather than conceptually innovative. - The paper provides qualitative intuition and t-SNE visualizations but lacks deeper analysis on what specific semantic attributes are suppressed or preserved through HSIC regularization. A more detailed investigation into feature disentanglement, representation shift, or artifact suppression would strengthen interpretability and scientific value. Without such analysis, the method may appear as a black-box regularizer rather than a principled representation intervention. - The experiments focus on ProGAN, SDv1.4, and 3DGS-based generators, which do not reflect the current state of generative technology, such as Stable Diffusion 3+, Midjourney, Sora 2, or FLUX models. Since newer generators produce more photorealistic and harder-to-detect outputs, evaluation on these models is essential to demonstrate real-world utility. The absence of such results weakens the strength of the claimed “generalization” capability. - How does HSIC specifically reshape CLIP features at different semantic granularity levels? Can the authors provide more concrete evidence—beyond t-SNE—that illustrates which types of semantic or generator-specific correlations are suppressed? - How well does the method scale when extended to modern, highly realistic generators such as SD3.5, Midjourney, Sora2, or FLUX? Have the authors tested whether the model remains effective with these more challenging sources? - In continual learning scenarios with more than 6–8 sequential domains, does HGR maintain long-term stability? Could the authors provide results over longer task horizons to support claims of scalability and robustness? - What is the computational overhead of using 24-layer CLIP features + HSIC loss during training and inference? Could the model be made more efficient without sacrificing performance? - Several recent approaches build upon CLIP to improve cross-generator generalization, and comparing against such methods would better contextualize the contribution of the proposed HSIC bottleneck. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next