|
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Dig2DIG, a dynamic image fusion framework built upon diffusion models, introducing Diffusion Information Gains (DIG) to dynamically guide the fusion process at each denoising step. The core idea is to weight modality contributions by quantifying how much information each modality can provide at each step, theoretically linking this approach to a provable reduction of the generalization error bound in multimodal fusion. Theoretical analysis, practical algorithmic realization, and extensive experimental validation are provided.
1. This paper thoroughly analyzes the theoretical motivation behind dynamic fusion in denoising diffusion models.
2. This paper systematically compares Dig2DIG with strong baselines over multiple challenging datasets.
1. This paper miss some discussion and comparison with several recent work, e.g., [R1-R2]
2. While the paper’s theoretical and algorithmic contributions are clear, the system-level architecture largely repurposes standard DDPM sampling with softmax weighting for fusion guidance.
3. While the mathematics generalizes to $K>2$, the paper does not present empirical or even synthetic evidence for how the framework scales with larger numbers or more heterogeneous modalities, nor does it discuss failure points in such scenarios.
4. Ablations that could further clarify the incremental importance of each design are only briefly touched upon. e.g., what if weighting is region-wise but not dynamic in time? What if different normalization or activation functions are used on DIG?
References:
[R1] Deng Y, Xu T, Cheng C, et al. Mmdrfuse: Distilled mini-model with dynamic refresh for multi-modality image fusion[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 7326-7335.
[R2] Yang B, Jiang Z, Pan D, et al. LFDT-Fusion: A latent feature-guided diffusion Transformer model for general image fusion[J]. Information Fusion, 2025, 113: 102639.
See the weakness |
Lightly AI-edited |
|
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Due to the spatiotemporal imbalance of information gain contributed by different image modalities during the denoising diffusion process, this paper proposes a diffusion-model-based dynamic fusion method that computes a distinct set of fusion coefficients at each diffusion time step, thereby optimizing modality-specific information gain across the different stages of the diffusion process.
- Overall, the discussion on dynamic fusion during the diffusion denoising phase is meaningful, as it aims to enhance the information gain achieved through fusion.
- The key idea of the proposed method for tightening the upper bound of fusion generalization is to ensure that the guidance weight assigned to each modality is positively correlated with the residual information of that modality that has not yet been integrated into the current image, and theoretical support for this is provided.
- Modality residual information is proxied by the information gain obtained during the diffusion process.
- Using DIG as a proxy for dynamic weighting is not entirely reasonable, since DIG only measures the information loss during the forward noising process. Intuitively, it is more correlated with the intrinsic information richness of the original image. Consequently, modality images containing richer information regions would consistently receive higher fusion weights at any diffusion timestep. This contradicts the methodological insight that different diffusion stages exhibit distinct generation dynamics, where low-frequency structures and high-frequency textures evolve at different rates.
- The illustration in Figure 1 may cause misunderstanding, as the information gain shown in the figure is different from the Diffusion Information Gain (DIG). It would be helpful for the authors to additionally indicate the computation method of the information gain depicted in the figure, so as to facilitate clearer understanding and distinction between the two concepts.
- An important question to consider is how the authors construct the loss when computing DIG, and whether using only the ℓ2 loss is reasonable. The dynamic weights derived solely from the ℓ2 loss merely reflect pixel-level differences, and it remains questionable whether directly employing such differences as dynamic weights to modulate other types of losses, such as gradient losses, is theoretically and practically sound.
- The authors have considered the dynamic weights in isolation, whereas in practice, the fusion weights are largely dependent on the formulation of the fusion loss constraints. However, the authors did not design targeted conditional constraints; instead, they directly adopted the conditional constraint scheme from DDFM
- The selection of comparison methods in the experiments appears somewhat inconsistent. For example, Text-IF is used as a comparison method for multi-focus and multi-exposure image fusion, even though it is not a unified fusion framework. In contrast, CCF, which is indeed a unified approach, is only compared in the task of infrared–visible image fusion.
- Is it reasonable to use DIG as a substitute for the dynamic weights?
- Is it reasonable to compute DIG using only the ℓ2 loss? For instance, can it effectively measure the information loss in texture gradients?
- It is not clearly explained how the dynamic weights are incorporated into the EM conditions; this part requires further clarification. In addition, it remains unclear whether such dynamic weights are compatible with the conditional constraint process—for example, with the gradient penalty term in the EM algorithm.
- Is the selection of comparison algorithms fair? |
Lightly AI-edited |
|
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a novel dynamic image fusion framework called **Dig2DIG**, which leverages **Diffusion Information Gain (DIG)**. Traditional diffusion-based fusion methods employ fixed fusion weights, ignoring the varying rates of information contribution among different image regions and modalities during denoising.
Dig2DIG addresses this issue by quantifying each modality’s information contribution at every diffusion step and dynamically assigning fusion weights based on DIG. Theoretically, the authors claim that aligning fusion weights with residual modality information can reduce the generalization error of the fusion model.
Experiments on multiple datasets (e.g., LLVIP, M3FD, MSRS, MFFW, MEFB) demonstrate that Dig2DIG achieves state-of-the-art performance in both fusion quality and computational efficiency.
1. **Novel Theoretical Foundation** – The paper provides a formal analysis suggesting that dynamic weighting via DIG can theoretically improve generalization compared to static fusion. The proof is sufficiently detailed.
2. **No-Training Framework** – The method reuses pre-trained diffusion models without additional training or fine-tuning, enhancing generality and reproducibility across tasks.
3. **Efficiency Improvement** – The inference time is significantly reduced (up to 70%) by skipping low-information diffusion steps, while maintaining comparable accuracy, demonstrating potential for practical efficiency gains.
1. **Limited Applicability Beyond Diffusion Models** – The theoretical proof relies only on L-smoothness assumptions and does not strictly require a diffusion process. However, DIG itself is closely tied to diffusion-based sampling. Its generalization to non-diffusion or Transformer-based fusion frameworks remains unclear.
2. **Practical Interpretability** – Although theoretically well-founded, the practical estimation of “information gain” is not intuitively interpretable for non-expert users, and the DIG computation may be sensitive to design choices such as the metric function.
3. **Insufficient Experimental Validation** – Several theoretical or qualitative claims—such as the correlation between residual information and reduced generalization error, or the spatio-temporal imbalance illustrated in Figure 1—lack quantitative or ablation experiments for confirmation. Many statements remain conceptual rather than empirically supported.
4. **Heuristic Nature of Step-Skipping** – The claim that “steps with low information gain can be skipped” is based on empirical observation rather than theoretical proof. While steps with small DIG values contribute minimally and can be skipped with negligible quality loss, there is no formal guarantee that diffusion consistency is preserved. Thus, the approach should be viewed as a heuristic efficiency optimization, not a theoretically grounded one.
5. **Unclear Definition of the Residual Information Term $I_{k,t}$** – This term plays a central role in the theoretical derivation (Eq. 4), yet its formulation is only symbolic. The function $\Delta I(c_k, x_t, x^*(c))$ is introduced without specifying its exact mathematical form, dimensionality, or relationship to observable quantities.
6. **Unexplained Link Between DIG and $I_{k,t}$** – The paper states that since $I_{k,t}$ is unobservable, it is replaced by the empirical DIG metric in Section 3.3. However, the connection between this substitution and later equations (e.g., Eq. 37 in the appendix) is not clearly explained.
7. **Inconsistent Probabilistic Assumptions** – The derivation first assumes i.i.d. (independent) modalities for linear weighting of $x_t$, but later uses the DDFM framework, which is based on a joint hierarchical Bayesian model under EM optimization (keeps the joint distribution). These two assumptions contradict each other, and the latter is theoretically more appropriate. This inconsistency undermines the claimed theoretical rigor.
8. **Problematic Weighted Approximation (Eq. 16)** – The authors initially assume conditional independence (Eq. 15), then relax this by introducing arbitrary weights $w_k$ without redefining the joint distribution $p(c_1, \dots, c_K | x_t)$. This heuristic substitution lacks probabilistic justification. When modalities are correlated—which is usually true since all source images come from the same scene—the gradient of the joint log-likelihood should contain cross terms that cannot be represented by a simple weighted sum of marginal gradients. As a result, Eq. (16) violates Bayes’ theorem consistency and weakens the “provable” nature of subsequent derivations.
9. **Redundant Description in Section C.1** – The paper states, “We follow the EM algorithm of DDFM,” but the process described is nearly identical to DDFM’s. The section should either explicitly detail differences in E- and M-step formulations compared to DDFMs' or be removed for conciseness.
10. **Inconsistent Equation Formatting** – Equation references are inconsistently labeled (“eq.”, “Eq.”, “equation”, “eq. equation”), which detracts from clarity and professionalism.
11. **Problematic geometric analysis** – The analysis is not mathematically well-grounded starting fron ( v = a,u_{IR} + b,u_{RGB} + c,u_S ) . The paper does not specify the underlying metric space, ignores the stochastic dependency of diffusion variables, and conflates statistical covariance with geometric cosine similarity. As a result, the claimed geometric intuition may be conceptually appealing but lacks rigorous justification or empirical verification.
1. Can the proposed DIG mechanism be extended to **non-diffusion generative models**, such as Transformer- or GAN-based image fusion systems?
2. Regarding **step-skipping (weakness 4)**: does “skipping low-information steps” correspond to using $S = 10$ or the DIG-25 variant for inference? What is the precise relationship between these two settings?
3. Could the authors provide additional **ablation or sensitivity analyses** to confirm how DIG values correlate with residual information and generalization error reduction?
4. How does the choice of metric $l(\cdot, \cdot)$ in Eq. (5) affect the stability of DIG computation?
5. How could the DIG framework be modified to capture **nonlinear inter-modality interactions** beyond simple time-dependent linear weighting?
6. What space or norm is this ''angle'' defined in, and how is the stochastic nature of the diffusion variables handled? |
Heavily AI-edited |
|
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a novel dynamic image fusion framework for diffusion models, addressing the limitations of static fusion strategies. The authors first identify a "spatio-temporal imbalance" in the denoising process, where information from source modalities emerges at unequal rates across different steps and regions. To leverage this, they propose a metric called Diffusion Information Gains (DIG) to quantify the per-modality information contribution at each denoising step. These DIG values are then used as dynamic weights to guide the fusion, ensuring that only the most informative regions from each source are integrated at the appropriate time. The authors provide a theoretical proof that this dynamic weighting scheme provably reduces the upper bound of the generalization error, and experimental results confirm that the method achieves superior fusion quality and inference efficiency compared to existing diffusion-based approaches.
The paper is based on an interesting observation of spatio-temporal imbalance in the denoising process. The proposed dynamic weighting scheme is an intuitive response to this identified issue.
1. The qualitative comparisons appear incomplete and inconsistent with the quantitative evaluation. While numerous methods are benchmarked in the quantitative tables, the qualitative results in the figures only feature a select subset. For instance, in the visible-infrared fusion task (Figure 3), several strong baselines such as SwinFusion, DIVFusion, MoE-Fusion, CDDFuse, and DDFM are notably absent from the visual comparison. This omission is also observed for the MFFW and MEFB datasets. The authors should provide a rationale for this selective comparison or include qualitative results for all methods to ensure a fair and comprehensive evaluation.
1. Addtionally, DCEvo, as a SOTA fusion method in IV fusion, is also recommended for comparative analysis in experiments.
[1] DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion. CVPR 2025.
2. The definition and application of the Generalization Error (GError) in Equation 3 are confusing in the context of image fusion. Generalization error is a concept fundamentally rooted in supervised learning, where a ground truth is available for evaluation. However, image fusion is an inherently unsupervised task that lacks a single, well-defined ground truth. Therefore, the direct application of this GError formulation seems unjustified. The authors must provide more substantial evidence or a more rigorous justification to demonstrate why this generalization bound is applicable and meaningful for an unsupervised task like image fusion.
3. The paper's central claim that the weights w_k derived from DIG guarantee a smaller generalization error bound—is not sufficiently proven. The theoretical connection between the proposed Diffusion Information Gain (DIG) and the reduction of the generalization error is tenuous. The manuscript lacks a formal proof demonstrating how optimizing for DIG directly leads to the tightening of this bound. Without this crucial link, the motivation for introducing DIG appears ad-hoc, and its relationship to the theoretical framework is not well-founded.
4. The visualization of the information gains in Figure 2 is perplexing and counterintuitive. The 'vi information gain' and 'ir information gain' maps appear to be almost perfectly complementary. The expectation is that the information gain for a modality should highlight unique information present in that modality. However, the 'vi information gain' map seems to highlight information behind the smoke, which is not visible in the original visible image and is instead a key feature of the infrared image. This is a significant contradiction. The authors must clarify the calculation and meaning of these gain maps and provide additional visual examples to resolve this apparent inconsistency.
Please see Weaknesses |
Heavily AI-edited |