|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors argue that the current unified processing pipeline adopted for image restoration in the scientific field may introduce redundant restoration types, which could counter productively lead to the loss of authentic information. Therefore, the authors aim to provide a controllable mechanism, Controllable Diffusion, to enable experts to select specific restoration types to avoid the aforementioned dilemma.
A.The authors present a novel perspective in addressing the issue.
B. The amount of work undertaken is solid.
>- They constructed a brand-new dataset and employed the latest relevant methods (mostly works from 2024) to demonstrate the superiority of the proposed approach.
I think subjective assumptions and the unconvincing problem-solving approach are the most prominent issues.
A.Subjective assumptions
>- A.1 Taking underwater image restoration, as mentioned in the article, as an example. The authors identify three types of distortions: low light, scattering, and wavelength dependency. Since these distortions act on the same image, any single restoration method may inadvertently damage useful information. Therefore, the core of fidelity preservation should lie in the precision of each restoration. However, the authors' equivalence of precision with controllability over the quantity and types of restoration is a unconvincing subjective assumption.
>- A.2 It is possible that the most appropriate combination of restoration types has already been implicitly learned by the model through training. However, the authors have neither experimentally nor theoretically refuted the aforementioned viewpoint nor provided substantive support for their own hypothesis.
>- A.3 The authors arbitrarily set the number of distortion types to 3, which is overly simplistic and lacks justification. This approach may fails to flexibly handle complex scenarios.
B.Unconvincing problem-solving approach
>- B.1 The core idea of artificial intelligence is to employ machine intelligence to replace human thinking. When encountering problems, the focus should be on solving them directly, rather than reversely introducing human labor costs to empower machines.
>- B.2 Although the experimental results show some improvement, these enhancements are not significant. Moreover, there is a lack of relevant analysis regarding the extent to which such improvements stem from the investment in human labor costs and whether they offer cost-effectiveness. Additionally, the ablation experiments fail to fully elucidate the impact of distortion type settings and their dependence on accurate expert guidance.
>- B.3 Assuming your hypothesis is correct, why not just solve the problem in a software engineering way? Let a single model specialize in removing a single type of noise, we design control codes to allow experts to manually execute several actions to process the images. Therefore, what is the point of incorporating NLP models for processing opinions of experts and coupling the problem together to train a model?
The same as the above analysis in Weaknesses. |
Lightly AI-edited |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces PRISM (Precision Restoration with Interpretable Separation of Mixtures), a prompted conditional diffusion framework for expert-in-the-loop controllable restoration of images with compound degradations. PRISM integrates two key components: (1) compound-aware supervision trained on mixtures of distortions, and (2) a weighted contrastive disentanglement objective that aligns compound distortions with their constituent primitives. This approach enables high-fidelity joint restoration.
1. The paper is well-structured and the motivation is clear.
2. The framework leverages compound-aware supervision and a contrastive disentanglement objective across a diverse set of primitive tasks. This produces separable embeddings of distortions, enabling robust restoration, even for unseen real-world mixtures.
3. This work proposes a novel benchmark for scientific utility, spanning remote sensing, ecology, biomedical, and urban domains.
1. The method is devised on the basis of CLIP and Latent Diffusion Model. Moreover, Semantic Content Preservation Module is also relatively simple.
2. There is no physical information being incorporated into the training process. In other words, it is a general method that is used for scientific domains.
2. The dataset is constructed by integrating existing datasets, which is not very solid.
1. Collecting a dataset from existing datasets is not very convincing. Have the authors collected any data?
2. It is a general method that is applied as a unified model for scientific and environmental images. More domain-specific priors, such as physical information, should be considered.
3. The text prompt is also short for benefiting these domains. |
Fully human-written |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a framework for controllable restoration of images that underwent multiple primitive distortions. The restoration of the degradations happens at once, rather than consecutive restorations that may introduce artifacts, yet enables defining which degradation to restore in order to preserve necessary signals.
As part of the training process, a restoration benchmark with tuples of a clean and a degraded image, along with a natural language prompt that describes the degradation is used, which is made public. Given a prompt, the framework includes using a frozen CLIP text encoder for multi-label classification from a set of primitive degradation, which are then formatted to a predefined form of prompt. This prompt is then used to restore the image by applying a finetuned version of SD 1.5, where the CLIP image encoder was finetuned to cluster embeddings by degradation, followed by an additional model trained to preserve semantic content.
The overall framework is evaluated on a benchmark with images that underwent distortions as the ones in the training process (up to 3 primitive distortions) and on unseen distortions. Furthermore, its usage for four downstream tasks is evaluated, showing scientific utility.
* Image restoration is a critical task, particularly for scientific applications. This paper demonstrates the method's effectiveness through general purpose image restoration, evaluated using fidelity and perceptual metrics, and its application for downstream scientific tasks.
* The motivation is written clearly, and the figures (although 1 and 2 are not referenced) support the understanding of the general approach.
* Although the number of consecutive distortions in the training set is limited to 3, section 4.2 shows that the method archives good results also for unseen degradations that are not necessarily a combination of the degradation in the train set, or are a combination of more than 3 primitive distortions.
* While the fine tuning of CLIP image encoder is explained thoroughly, the following steps of how SD 1.5 is used as the backbone and the suggested SCPM module are explained only briefly. This impairs the understanding of the entire framework, and while the code is submitted, the text itself is not sufficient to reproduce the code.
* The concept of automatic restoration needs clarification. While the paragraph on prompting (line 207) describes the automatic transformation of natural prompts to fit the required format, the method for generating these automatic prompts remains unclear. Possibly related, it is unclear which prompts were used in the experiments in Sec. 4.1.
* The motivation of controlled restoration for expert in the loop scenarios could be further evaluated by comparing images that underwent multiple distortions but PRISM is prompted to restore only a partial set (as in Sec. 4.3.1 on synthetically) where the control of the restorations suggests different restorations for specific images / use cases rather than a predefined subset for all images in the same domain.
**Minor:**
* Indices are not explained and somewhat confusing. It seems $d_{i_j}$ in Eq. 1 denotes a specific distortion and $d^i$ in row 180 denotes a set of distortions, yet the notations are explained only after being used ($d_j$ is explained in line 200).
* Using the Jacard distance between degradation sets neglects how some distortions are more similar than others.
* Should mention how the prompts in the dataset are auto generated (line 157).
* In addition to PSNR reported in Fig 3, what was the effect on other discussed metrics?
* What value is used for the number of variants $m$ ? And what is the minibatch size? If these values are not similar there might be added values in weighing the two terms in the denominator of the per-variant contrastive loss to control the effect of repelling from other degradations and that of repelling from other images.
* Is there a difference between the dataset described in Sec.3.1 and the benchmark described in 3.2 (MDB)? If so, what is included in MDB?
* In Sec. 4.2, did PRISM identify the same set of primitive distortions for different images from each domain where images probably went through similar degradation?
* Which images were used to create the visualization of the scatter plot in Fig. 5? |
Fully human-written |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents PRISM, a prompted, controllable diffusion framework for restoring images suffering from compound degradations. The training setup uses mixtures of up to three distortions and introduces a weighted contrastive disentanglement stage to make embeddings separable and compositional. Inference accepts free-form prompts that are mapped to a fixed set of restoration labels; a latent-diffusion backbone performs restoration and a Semantic Content Preservation Module (SCPM) refines fine detail. Experiments cover a new Mixed Degradations Benchmark (MDB), zero-shot evaluation on real datasets (UIEB, under-display camera, and fluid lensing), and downstream tasks across remote sensing, ecology, microscopy, and urban scenes.
1) Clear objective and method design. The paper argues for simultaneous rather than sequential restoration, emphasizes expert control, and focuses on scientific fidelity rather than aesthetics. The architecture coherently combines contrastive disentanglement, prompt-conditioned latent diffusion, and SCPM for detail recovery.
2) Good reported performance and breadth. On MDB, PRISM outperforms representative all-in-one and diffusion/composite baselines (e.g., AirNet, Restormer, NAFNet, PromptIR, OneRestore, DiffPlugin, MPerceiver, AutoDIR) on PSNR/SSIM and perceptual metrics.
3) Generalization beyond the synthetic training setup. The paper reports zero-shot gains on real distortions (underwater, under-display camera, and fluid distortions) and shows that performance scales well as the number of simultaneous degradations increases.
4) Practical value for scientific analysis. Selective, prompt-guided restoration improves downstream tasks in several domains, supporting the utility of controllability.
1) Control granularity and evaluation scope:
The evaluation largely uses manual prompting with a pre-defined set of distortion types, not open-ended language or fine-grained controls. The paper itself notes that extending controllability beyond “which distortions to remove” to specifying intensity and spatial extent is left for future work. This leaves unanswered how robust the system is to realistic prompt variations or local/severity-aware edits.
2) Synthetic-to-real gap and capped composition complexity:
Training relies on synthetic mixtures and explicitly caps each sample at up to three distortions for efficiency and interpretability. The authors acknowledge that these synthetic augmentations cannot fully capture real distortions. While results on real datasets are encouraging, this cap and reliance on synthetic compositing may underrepresent harder real-world compound effects.
3) Efficiency and deployability are not quantified in the main text:
The paper does not provide main-text wall-clock, throughput, or memory comparisons versus strong diffusion baseline. Without standardized timing/FLOP/peak-memory profiles at a fixed resolution, it is difficult to assess practical deployability or the overhead of the added control and SCPM modules.
. |
Fully AI-generated |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces PRISM, a controllable diffusion framework for compound image restoration, primarily targeting scientific applications. The method involves a two-stage process: first, fine-tuning a CLIP image encoder using a novel weighted contrastive loss to create a compositional embedding space for degradations , and second, training a latent diffusion model conditioned on these embeddings and user prompts to perform selective restoration.
The work compellingly argues for the necessity of controllable, selective restoration over automated 'full' restoration for scientific applications, demonstrating significant gains in downstream task utility.
The primary methodological concern is the limited novelty. The core idea of fine-tuning a CLIP encoder to be degradation-aware heavily relies on prior work (e.g., DA-CLIP). The main novelty appears to rest on the Jaccard distance weighting in the contrastive loss, but the paper lacks a direct ablation comparing this to an unweighted compound contrastive loss, making it difficult to isolate its true impact.
Second, the two-stage training pipeline is computationally complex, and the choice of a diffusion backbone introduces significant inference latency. This efficiency trade-off is not sufficiently justified, especially as the performance gains over other recent diffusion methods are notable but not transformative.
Finally, the framework's generalization to real-world, unseen degradations is questionable. The model is trained exclusively on synthetic composites , and it is unclear if the model is truly learning compositional physics or rather a powerful interpolation across its massive synthetic training domain when faced with the complex, non-linear physics of real-world degradations.
What is the inference time of PRISM compared to the baselines (e.g., MPerceiver, AutoDIR, and the non-diffusion NAFNet), and how do the authors justify this computational cost for the observed performance gain?
Given the reliance on synthetic data , how can we be sure the model is learning robust compositional reasoning for real-world physics rather than a complex interpolation? |
Moderately AI-edited |