|
SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SP-MoMamba, a superpixel-driven state space modeling framework that addresses a fundamental limitation of Mamba-based image restoration models—namely, the semantic distortion introduced by flattening 2D images into 1D scan sequences. To preserve spatial coherence, the authors introduce SP-SSM, which aggregates semantically homogeneous pixels into superpixels prior to state-space modeling, thereby maintaining regional semantic consistency.
The overall architecture is composed of stacked Layers of Experts (LoE), each consisting of a Semantic-Guided Mamba Expert (SGME) for global structure modeling followed by a Local Spatial Modulation Expert (LSME) for fine-grained texture reconstruction. Within SGME, the model incorporates a Multi-Scale Superpixel Mixture-of-Experts (MSS-MoE) module, which performs sparse routing across multi-scale SP-SSM experts. All experts participate during training, while only the Top-k experts are activated at inference, making the inference cost essentially independent of the total number of experts.
Overall, the paper’s contribution lies in embedding semantic priors into state-space modeling via “superpixel representation + multi-scale MoE + sparse routing,” enabling improved long-range structural consistency while preserving local detail, and delivering strong performance under constrained computational budgets.
This paper introduces a novel perspective for improving state-space models in image super-resolution by addressing the semantic disruption caused by 2D-to-1D scanning. The proposed superpixel-driven state-space formulation preserves regional semantic consistency and represents a creative integration of perceptual grouping principles with efficient sequential modeling. The use of multi-scale superpixel mixture-of-experts and sparse routing further strengthens the approach, enabling an effective balance between global structure modeling and fine-grained detail restoration. Empirical results across standard benchmarks demonstrate consistent gains over strong lightweight baselines, highlighting both the technical quality and practical significance of the contribution.
While the paper presents a compelling framework, several areas warrant further development to strengthen its contribution. First, the reliance on pre-defined superpixel scales introduces sensitivity to hyperparameter choices and may limit robustness across diverse visual domains; an adaptive mechanism or learning-based superpixel module would enhance generalization. Second, the evaluation focuses primarily on bicubic degradation, leaving open questions regarding performance under real-world or unknown degradation settings, where superpixel consistency may be more fragile. Third, although the MoE routing scheme is conceptually sound, deeper analysis of expert specialization (e.g., visualization of expert roles, load-balancing behavior[1][2][3]) would clarify the functional contribution of the mixture structure. Finally, while the paper situates its contributions within the Mamba-based SR literature, stronger comparison to recent frequency-aware and hybrid prior-guided SR models would further contextualize the novelty and demonstrate robustness across broader architectural trends.
[1]Dai, T., Wang, J., Guo, H., Li, J., Wang, J., & Zhu, Z. (2024, August). FreqFormer: Frequency-aware transformer for lightweight image super-resolution. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 731-739).
[2]Huang, F., Liu, H., Chen, L., Shen, Y., & Yu, M. (2025). Feature enhanced cascading attention network for lightweight image super-resolution. Scientific Reports, 15(1), 2051.
[3]Wang, Y., Liu, Y., Zhao, S., Li, J., & Zhang, L. (2024). CAMixerSR: Only Details Need More" Attention". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 25837-25846).
1.Could you elaborate on how the superpixel clustering is integrated into the end-to-end training pipeline? For instance, is the clustering fully differentiable, and how is stability ensured when using Gumbel-Softmax for hard assignment? Additionally, were alternative soft segmentation strategies (e.g., learnable token pooling or differentiable k-means variants) evaluated, and how do they compare in terms of convergence and gradient behavior?
2.Can you provide more detail on the computational breakdown of SP-SSM during inference? While the paper emphasizes the overall efficiency, it would be helpful to quantify the latency contributions from superpixel generation, routing, and SSM inference separately. Also, how does the method scale on high-resolution inputs and resource-constrained devices, relative to Transformer-based SR models and MambaIR?
3.In your discussion on multi-scale expert routing, could you share empirical evidence that different experts specialize in distinct spatial scales or semantic structures? For example, visualizing expert activation patterns or analyzing usage frequency across datasets would help clarify whether the mixture-of-experts contributes meaningful functional diversity beyond parameter expansion.
4.To better substantiate the claimed advantages, could you add comparisons or re-train baselines against recent lightweight and frequency-aware SR models—e.g., FreqFormer [1], FECAN/FECA [2], and CAMixerSR [3]—under the same training protocol and evaluation settings (×2/×4, identical data and metrics)? Reporting PSNR/SSIM as well as perceptual metrics (LPIPS/DISTS) and latency would help position your method against the current state of the art.
[1]Dai, T., Wang, J., Guo, H., Li, J., Wang, J., & Zhu, Z. (2024, August). FreqFormer: Frequency-aware transformer for lightweight image super-resolution. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 731-739).
[2]Huang, F., Liu, H., Chen, L., Shen, Y., & Yu, M. (2025). Feature enhanced cascading attention network for lightweight image super-resolution. Scientific Reports, 15(1), 2051.
[3]Wang, Y., Liu, Y., Zhao, S., Li, J., & Zhang, L. (2024). CAMixerSR: Only Details Need More" Attention". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 25837-25846). |
Fully AI-generated |
|
SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper identifies a fundamental issue in applying Mamba-based State Space Models (SSMs) to single-image super-resolution (SR): the standard 1D scanning process disrupts the 2D semantic structure of images, impairing the model's ability to capture local details. To address this, the authors propose SP-MoMamba, a novel framework that introduces superpixels as primary semantic units to preserve spatial relationships. The key innovation is the Superpixel-driven State Space Model (SP-SSM), which operates on superpixel regions to maintain semantic consistency. The method employs a hierarchical expert architecture to balance global semantic modeling with local detail reconstruction, aiming for an improved efficiency-performance trade-off.
* Novel and Well-Motivated Problem Formulation: The paper pinpoints a fundamental yet overlooked issue in adapting Mamba for vision tasks: the "semantic disruption" caused by flattening 2D images into 1D sequences. This provides a compelling and well-justified motivation for the work.
* Creative Solution: The core idea of leveraging superpixels as foundational units for a State Space Model (SSM) is novel. The proposed Superpixel-driven SSM (SP-SSM) addresses the identified problem by inherently preserving 2D semantic relationships.
* Comprehensive and Convincing Experiments: The proposed method is rigorously compared against a wide range of existing state-of-the-art approaches across standard benchmarks. Extensive ablation studies convincingly validate the contribution of each core component, such as the SP-SSM and the hierarchical expert architecture, demonstrating their necessity and effectiveness.
* Insufficient Justification for Claims: The assertion that previous methods (e.g., multi-directional scanning) "fail to address the fundamental problem" is somewhat strong without providing quantitative evidence or a specific metric that directly measures "semantic disruption" to support this claim convincingly.
* Limited Scope of Evaluation: A primary concern is that the experimental evaluation is conducted primarily on synthetic datasets (e.g., with bicubic degradation). The paper does not demonstrate the method's performance on real-world images with complex, unknown degradations ("blind" image restoration). This omission significantly limits the claim of the method's practical applicability and generalizability, which is crucial for real-world scenarios.
The paper identifies "semantic disruption" as a key limitation of Mamba-based SR. Beyond qualitative illustrations (e.g., Figure 1), are there any quantitative metrics proposed to directly measure and compare the "semantic preservation" capability of different methods? |
Fully AI-generated |
|
SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SP-MoMamba for the super-resolution (SR) task, which combines super-pixels and mixture of experts with state space models (SSM). The potential contributions are:
(1) an SP-SSM that combines superpixel sampling and SSM through a gating mechanism,
(2) an MSS-MoE that uses a routing mechanism to dynamically select features from multiple SP-SSM experts.
The method is evaluated on five standard SR benchmarks and demonstrates a compelling performance-efficiency trade-off: achieving competitive results with lower computational overheads.
**Clarity and Reproducibility:** The paper is generally well-written and easy to follow. The inclusion of pseudo-code and the source code significantly strengthens the paper's credibility and reproducibility.
**Technical Interest:** The idea of exploring structured representations (super-pixels) with SSMs is interesting.
**Empirical Performance:** The results are impressive. The method achieves state-of-the-art or highly competitive performance across multiple datasets, with lower computational costs, including GMACs, inference times, GPU memory usage, etc. Figures 5, 7, and 8 provide compelling visual results.
**Weak Motivation:**
The paper positions itself as the first work to combine super-pixels with SSMs for SR. However, the motivation for this specific combination is not sufficiently developed. Super-pixel has been explored with transformers for SR in (Zhang et al. 2023). Considering SSMs are a popular substitute for transformers, and there are already Mamba-based SR methods (e.g., (Gut et al., 2024; Qiao et al. 2024)), combining the super-pixel with SSMs instead of transformers is not significant.
A strong narrative should answer: What specific limitations of existing SSM-based SR methods (e.g., high frequency details and/or semantic inconsistency) can be effectively addressed by superpixels? Meanwhile, what limitations of super-pixel-based methods are addressed by the SSM? What are the challenges of combining these two? The current paper does not provide a new insight, making the core idea feel more like a clever stacking of existing techniques than a novel solution to a well-defined problem.
**Unclear Novelty and Justification of Architectural Complexity:**
SP-SSM and MSS-MoE are claimed as two major contributions/novelties. However, these two modules are complex compositions of established techniques. The SP-SSM is a combination of super-pixel sampling (Jampani et al., 2018), SSM, and Gumbel-Softmax (Jang et al., 2016) through a gating mechanism. The MSS-MoE is mainly a combination of SP-SSM and the routing mechanism. The paper will need to address the following:
(1) Explain the necessities of the designs. For example, why is this gating mechanism in SP-SSM the best choice for combining super-pixels and SSM? What is the insight behind it?
(2) Demonstrate their effectiveness over simple baselines. For example, how does the MSS-MoE perform against a simple baseline of a non-dynamic ensemble of experts? Does the routing mechanism play a key role, or does the performance benefit from a large aggregation model?
(3) Quantify individual contributions. The ablation study (Table 3) is incomplete. The contributions of many design choices, such as the GatedFFN, the residual scaling parameters, and the specific formulations of Eqs. (1, 5-7) are not reflected.
The core issue is that this method feels over-engineered, with many components whose individual necessity and novelty are not rigorously established. This buries the true potential research contribution under a layer of engineering complexity. The strong results may be attributed to the overall capacity and designs, but the scientific insight remains unclear.
**Justification:** This paper proposes a new method for super-resolution, whose efficiency and performance results are compelling. The major concerns are about insufficient articulation of its conceptual novelty and justification of the necessity of its complex design through rigorous ablation. The contribution seems more from a strong engineering effort rather than a clear scientific insight. The authors will need to address these concerns during the rebuttal.
1. Beyond the empirical results, what is the core scientific insight from combining superpixels with SSMs? Can the authors explain why this combination is particularly powerful for SR (but not other dense prediction tasks)?
2. Could the authors provide ablations comparing SP-SSM and MSS-MoE to their corresponding strong/simpler baselines? For example, SP-SSM vs. SP-SSM (w/o gating), and MSS-MoE vs. a direct fusion of experts.
3. Please consolidate the ablation studies into a single table in the main paper (e.g., expanding Table 3 to include ablations mentioned above and those from Table 9). This is essential for the readers to understand the contribution of each part of this method.
4. The method appears to be a general-purpose dense prediction architecture. Have the authors validated it on other tasks (e.g., segmentation, enhancement)?
5. Please provide a discussion on any failure cases and analyze the intrinsic limitations of this method in the context of SR. |
Fully human-written |
|
SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes SP-MoMamba, an image super-resolution method that integrates Superpixel and State Space Model (SSM), including the Superpixel-driven State Space Model (SP-SSM) and Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE) modules. The method enhances modeling efficiency while preserving semantic structure integrity, and achieves superior performance over existing lightweight methods on multiple mainstream datasets.
1、English expression in this paper is of high quality.
2、The motivation section reasonably identifies the issue of texture and semantic information disruption in existing Mamba-based methods, while the proposed solution demonstrates considerable inspiration and novelty.
3、objective metrics (PSNR, SSIM), number of parameters, and GMACs all surpass those of existing methods, demonstrating excellent performance.
4、Experiments exhibit a high level of completeness, with thorough comparative analyses and relatively comprehensive ablation studies.
1、Potential Lack of Novelty:
SPIN [1] has already employed superpixels + attention mechanisms for super-resolution (SR). This paper merely replaces attention with an SSM (State Space Model) without sufficiently justifying the unique advantages of SSMs in superpixel modeling. The authors should emphasize why the combination of Mamba and superpixels is particularly reasonable. Relying solely on experimental results for justification lacks persuasiveness.
2、There are some textual errors. For example, in Line 90: "Technically, our SP-MoMamba is composed of stacked Layer of Experts" – "Layer" should be "Layers".
Table1 (Line 325 ), PSNR of CARN-M on set14 is "33.26" not :33..26"
Reference: [1]Zhang, Aiping, et al. "Lightweight image super-resolution with superpixel token interaction." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
Please kindly ask the authors to focus their rebuttal on addressing Weakness 1. |
Lightly AI-edited |