|
Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a novel framework for single-image super-resolution that leverages pretrained diffusion models without requiring their fine-tuning or iterative denoising steps. DM-SR trains an image encoder to map LR inputs into a latent distribution aligned with the diffusion model’s training space. The method adaptively predicts the appropriate noise level for each input, ensuring optimal alignment with the diffusion model’s timestep-dependent distribution. Extensive experiments on synthetic and real-world benchmarks demonstrate that DM-SR achieves state-of-the-art perceptual results.
1. This paper proposes a novel way to leverage pre-trained diffusion model for SR, by aligning the latent representations of a LR image with an intermediate noisy latent representation of diffusion model. It does not require training the pre-trained model which has the merits of take full advantage of the pre-trained knowledge.
2. The paper is well written, and the work and contribution are clearly presented.
3. Extensive experiments are conducted showing the superior non-reference performance compared to existing pre-trained diffusion based method.
1. It would strengthen the paper to include a more in-depth analysis comparing the proposed model with single-step diffusion-based super-resolution methods such as invSR. Since both approaches share similar motivations and framework designs, a detailed explanation of the differences—and why the proposed method achieves superior results—would help clarify its unique contributions and advantages.
2. Most of the comparison methods in the experiments are single-step diffusion-based super-resolution approaches. However, there are other ways to leverage pretrained diffusion models for super-resolution. Including comparisons with a broader range of diffusion-based SR methods would provide a more comprehensive understanding of how different strategies for utilizing pretrained diffusion models impact SR performance.
3. The proposed method does not demonstrate strong performance on reference-based metrics, and in some cases, it underperforms compared to certain single-step diffusion-based super-resolution approaches. Providing a more detailed explanation for these results would be very helpful in understanding the limitations and potential areas for improvement.
Please responds to my concerns in Weaknesses. |
Fully human-written |
|
Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper, "BRIDGING THE DISTRIBUTION GAP TO HARNESS PRE-TRAINED DIFFUSION PRIORS FOR SUPER-RESOLUTION," proposes DM-SR, a novel framework for single-image super-resolution (SR) that leverages the strong generative prior of a fixed, pretrained diffusion model (SD-Turbo is used). The core idea is to address the distribution mismatch between low-resolution (LR) inputs and the diffusion model's native training space (Gaussian-corrupted images) by training only an image encoder ($E_{\theta}$). This encoder adaptively maps the LR image into a timestep-dependent latent distribution, where the timestep $\hat{t}$ is predicted based on the input's degradation level. The resulting latent is decomposed into an image part ($Z_{SR}$) and a noise part ($\epsilon_{SR}$) using the fixed denoiser ($\mu_{\psi}$). The encoder is optimized using a comprehensive loss function that includes a novel conditional adversarial loss, a distribution matching loss, and a noise regularization loss. DM-SR achieves superior perceptual quality with a single, highly efficient step5
1. Preservation of Generative Prior: The core strength is that the method fully utilizes the fixed, pretrained denoiser ($\mu_{\psi}$), maintaining its powerful generative capabilities without compromise.
2. Efficiency: DM-SR is a single-step method, demonstrating the fastest runtime among competitive diffusion-based SR approaches while delivering outstanding perceptual results.
3. Noise-Adaptive Alignment: The dynamic prediction of a timestep ($\hat{t}$) based on the input image's degradation level is a clever mechanism to allocate the appropriate amount of generative power.
1. Dependency on Timestep Estimator: The success of the method critically depends on the accuracy of the Timestep Estimator ($T$), which is trained to predict a degradation level (normalized LPIPS). The justification for LPIPS as the "ground truth" for the degradation level could be elaborated more in the main paper。
2. Fixed Text Condition: The denoiser $\mu_{\psi}$ is conditioned using a fixed, generic text prompt ("High-quality, high-contrast, photo-realistic, clean..."). This design choice is simple but potentially limits the full capacity of the diffusion model, which is typically capable of content-aware generation via text conditions.
1. Timestep Estimator Analysis: The reliance on the LPIPS score [0, 500] as the ground truth for $\hat{t}$ is novel but requires further analysis. Please show how the performance is affected if the estimated timestep $\hat{t}$ is replaced by a fixed, arbitrary timestep (e.g., $t=250$) across all samples. This would directly isolate the contribution of the adaptive alignment mechanism.
2. Multi-Step Comparison: The single-step generation is highly efficient. However, since $X_{SR}^{\hat{t}}$ is meant to align with the noisy latent $X_{HR}^{\tilde{t}}$, it is a valid input for a multi-step DDPM process. Please provide a comparison between DM-SR-1 (decoding $Z_{SR}$) and running a full multi-step sampler (e.g., 50 steps) initialized from $X_{SR}^{\hat{t}}$. This would confirm if the single-step is truly the optimal output. |
Fully AI-generated |
|
Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes DM-SR, a diffusion-based super-resolution (SR) framework designed to bridge the distribution mismatch between low-resolution (LR) images and the Gaussian-corrupted image space used by pretrained diffusion models.
Instead of fine-tuning the diffusion model (which may degrade its generative prior), the authors train a lightweight image encoder that projects LR inputs into a diffusion-aligned latent distribution while keeping the pretrained denoiser fixed.
This paper proposes a diffusion-based SR method without the requirement of modifying the pre-trained Diffusion model weights, and achieves promising results.
Q1. The second paragraph identifies the limitations of existing diffusion-based SR methods, such as sampling from pure Gaussian noise and the corruption of generative priors during fine-tuning. This paper aims to address these issues. However, similar challenges have already been discussed in several prior works, even though they have not been completely resolved. Therefore, it is essential for the authors to analyze and discuss these earlier studies to properly position their contribution and demonstrate respect for prior research efforts. Without such discussion, the motivation appears insufficiently justified.
Q2. The paper states that the ground truth for the timestep predictor is obtained using the LPIPS metric. More technical details about this process are needed — for example, how LPIPS is used to determine the optimal timestep.
Q3. The proposed method employs a combination of multiple loss terms. From Table 5, it seems that the performance gain mainly originates from the adversarial loss $L_{adv}$. Given the adversarial loss $L_{adv}$, the necessity of the other two loss components of $L_{dm}$ and $L_{eps}$ should be further validated.
Q4. The reference metric values for the ImageNet dataset should be reported for completeness. Additionally, the synthesis process used to construct the ImageNet dataset for the experiments should be clearly described, including degradation types, resolution settings, and data splits. This information is crucial for ensuring experimental transparency and reproducibility.
Q5. Since this work closely follows the recent InvSR framework, it would be beneficial to include a direct comparison with InvSR in Table 4. It seems that the authors deliberately weakened the relation of the proposed method to InvSR, as pointed out in Q1.
NA |
Lightly AI-edited |
|
Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper tackles the "distribution gap" that arises when applying pre-trained diffusion models (DMs) to image super-resolution (SR). The authors note that DMs are trained on Gaussian-corrupted images, a distribution that low-resolution (LR) inputs do not match. Existing methods attempt to solve this by fine-tuning the DM, which can weaken its powerful generative prior, or by using multi-step inference, which is computationally expensive. The authors propose DM-SR, a novel framework that keeps the pre-trained diffusion model completely frozen. Instead, it trains only an image encoder. This encoder is designed to do two things: (1) map the LR input image into the latent distribution based on the frozen DM, and (2) adaptively predict the appropriate noise level (timestep) based on the degradation of the input image . This approach allows the framework to harness the full generative power of the pre-trained model and achieve state-of-the-art perceptual quality in a single inference step. Extensive experiments show the method is highly efficient (achieving the fastest runtime among diffusion-based methods ) and effective on both synthetic and real-world datasets.
The prevailing paradigms for diffusion-based SR involve either fine-tuning the denoiser (which compromises the prior ) or complex multi-step conditioning. The proposed idea of keeping the denoiser frozen and instead learning to map the input into its native distribution is an elegant and powerful alternative. This approach simultaneously solves two major problems: it preserves the integrity of the powerful generative prior and enables extremely efficient single-step inference. This is a valuable conceptual contribution that will likely be influential for the ICLR community and future work in diffusion-based image restoration.
**Major Weaknesses:**
(1) My primary concern is the complexity of the objective function. The final loss is a combination of five different loss terms, which seems to be very complex. The weighting parameters are simply stated as "empirically chosen", and the paper provides no clear explanation or justification for why these specific hyperparameters were set. This makes the method seem brittle and difficult to reproduce. While the ablation in Table 5 shows that all components contribute to the final result 3, the paper lacks a strong justification for why this specific combination is necessary, especially for the two loss terms.
(2) The paper acknowledges the method's poor performance on distortion metrics (PSNR/SSIM) and frames it as a standard perceptual-distortion trade-off. However, the limitation shown in Figure 5 (changing the eye's pupil color from gray to black ) is a clear example of sacrificing semantic fidelity for perceptual realism. This is a significant issue. While the black pupil may appear more "realistic" (which is why the NR metrics are high), it is an uncontrolled hallucination—the model is generating a feature not present in the input. There is no evidence that this "fix" is semantically correct, and no guarantee that it can be correctly generated every time. This lack of faithfulness deserves a more in-depth discussion.
**Minor Weaknesses:**
(1) The analysis in Table 4 regarding the number of steps is confusing. While the paper explains that too many steps (10+) hurt performance due to the SD-Turbo backbone, it fails to explain the trend for 1, 2, and 5 steps. For some metrics (like LIQE), performance peaks at 2 or 5 steps and is slightly worse at 1 step. The paper does not address this fluctuation, making it seem like random noise rather than a significant, well-understood trend.
(1) Could you provide more justification for using the LPIPS score as the regression target for the timestep estimator? Why was this metric chosen over other, simpler metrics (like PSNR) or a learned degradation classifier?
(2) In Table 4, the model performance varies with the number of steps. The paper explains why many steps (10+) hurt performance due to the SD-Turbo backbone. However, it does not explain the trend for 1, 2, and 5 steps, where performance seems to peak at 2 or 5 steps and then slightly decrease at 1 step (e.g., for LIQE ). Is this slight variation statistically significant, or is it more likely a random fluctuation within a stable performance range for low step counts? |
Lightly AI-edited |