|
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors present evidence that multi-modal reasoning models significantly outperform multi-modal base models on complex tasks; however, this improvement also leads to an increase in the generation of harmful content. Through experimental analysis, they identify two primary causes: 1) visual attention drift and 2) unsafe reasoning patterns. They note that existing methods primarily focus on teaching models how to reject harmful outputs without providing guidance on safe reasoning. To address this issue, the authors propose a two-stage alignment framework called Policy-guided Safety Tuning. Testing on various multi-modal safety benchmarks demonstrates that this method significantly reduces the rate of harmful content generation while also performing well on Visual Question Answering tasks, without exhibiting issues of over-sensitivity.
1. The paper is well-written, demonstrating clarity and coherence throughout.
2. The authors provide a thorough comparison of multiple baseline methods, testing their proposed approach against a diverse array of benchmarks.
3. The paper explores the relationship between multimodal attention mechanisms and safety considerations, contributing valuable insights to the field.
4. The motivation is really good.
The proposed method heavily relies on the quality of the training data, which is entirely AI-generated and evaluated. This might introduce a significant bias.
1. It might be better if the authors clarity their contribution as a dataset.
2. Could you explain how the paper addresses potential biases introduced by using AI-generated data, such as manual annotation or other approaches for data quality control?
3. JailbreakV [1] shows that even randomly generated images and harmful questions could attack MLLMs easily. Could the authors explain the effectiveness of PST under these conditions? Additionally, how do unrelated image descriptions impact the model's responses?
[1] JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks |
Lightly AI-edited |
|
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies a safety–reasoning trade-off in Multimodal Large Reasoning Models (MLRMs): reasoning-tuned models produce more harmful outputs than their base counterparts. It analyzes two mechanisms—(i) visual attention drift that weakens visual grounding and (ii) unsafe reasoning patterns such as flawed reasoning initiation and chain-of-thought safety attenuation. The authors also propose Policy-guided Safety Tuning (PST), a two-stage framework: Policy-Guided Supervised Fine-Tuning (PST-SFT) that embeds explicit safety policies into reasoning trajectories, and Safety Reasoning Preference Optimization (SRPO) that aligns toward safe yet informative responses via a preference loss. Experiments on BeaverTails-V, MM-SafetyBench, SPA-VL, SIUO, and general VL benchmarks show reduced harmful rate (HR) and refusal rate (RR) while maintaining competitive VQA/MathVista performance.
1. The paper addresses a timely problem: how to teach reasoning models to think safety.
2. Experiments indicate that PST attains reasonable safety gains without severely harming general capabilities.
3. The paper is well written.
1. **Limited depth of insight.** One of the paper’s central mechanisms, **Visual Attention Drift**, appears to be directly taken from [1] rather than newly discovered or substantially extended in this work.
2. **Motivation and experiments are not well aligned.** The Introduction (Section 1) and the Analysis (Section 3) devote substantial space to Visual Attention Drift and Unsafe Reasoning Patterns, yet the experiments do not demonstrate how PST resolves or alleviates these two phenomena. I did not find corresponding analyses in the main text or the appendix.
3. **Method novelty is incremental.** The two key components, **PST-SFT** and **SRPO**, are essentially direct transfers of widely used techniques (**SFT** and **DPO**). The overall **SFT + DPO** training pipeline is now mainstream; thus, much of the observed improvement likely stems from established methods rather than a fundamentally new algorithmic contribution.
4. **Data and reproducibility.** Following 3, the main innovation appears to lie in data processing and dataset construction. Building the dataset depends on additional models (e.g., Qwen-vl, DeepSeek-r1, GPT-4o), yet the costs in time, storage, and token usage are unclear, and it is not stated whether the data or code will be released.
[1] Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523, 2025a.
1. How, quantitatively, does PST reduce Visual Attention Drift, and how do before–after attention changes correlate with harmful-rate reductions?
2. Do Unsafe Reasoning Patterns decrease after PST, and how are these patterns operationalized and measured across reasoning steps?
3. Beyond standard SFT and DPO, what are the specific algorithmic novelties of PST-SFT and SRPO, and how sensitive are gains to these components?
4. What are the token, time, compute, and storage costs of constructing DSFT and DSRPO? |
Lightly AI-edited |
|
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the safety-reasoning trade-off in Multimodal Large Reasoning Models (MLRMs), highlighting that improved reasoning often coincides with increased harmful output rates. The authors identify two key mechanisms behind safety degradation—visual attention drift and unsafe reasoning patterns. To address these, they propose Policy-Guided Safety Tuning (PST), a two-stage alignment framework consisting of Policy-Guided Supervised Fine-Tuning (PST-SFT) and Safety Reasoning Preference Optimization (SRPO). Experiments on multiple benchmarks show that PST significantly reduces harmful outputs without major sacrifices in general reasoning capabilities.
1. The paper offers a clear and organized examination of safety degradation in MLRMs, identifying two representative mechanisms—visual attention drift and unsafe reasoning patterns. While parts of the analysis are qualitative, the work provides a useful foundation for understanding how reasoning-oriented tuning may influence safety performance across benchmarks.
2. The proposed two-stage PST framework effectively integrates explicit safety policies into reasoning, moving beyond refusal-based alignment toward interpretable, reasoning-centered safety control.
3. The dataset construction is detailed and replicable, with Figure 6 making the process and logic transparent.
1. The motivation underlying the claimed "reasoning–safety trade-off" is not entirely sound. In Section 3.1, the authors compare reasoning-tuned models with their corresponding base models and attribute the observed safety degradation to reasoning itself. However, this conclusion lacks causal rigor. The performance gap may simply stem from the absence of safety-aware data during reasoning-tuning rather than an inherent conflict between reasoning and safety. Moreover, the analysis does not control for data composition or fine-tuning objectives, which weakens the validity of the stated correlation.
2. The failure analysis in Section 3.2 appears rather subjective. The authors summarize two types of failure modes but do not describe the process or key data used to derive them. For example, it remains unclear what proportion of the failed cases fall under *Visual Attention Drift* or *Unsafe Reasoning Patterns*. Moreover, no empirical evidence is provided to show that *Visual Attention Drift* directly causes unsafe behavior, the logical connection between the two is not clearly established, even though they are observed to co-occur statistically.
3. The proposed Policy-Guided Safety Tuning (PST) framework is conceptually similar to existing approaches (e.g., Sun et al., 2023; Guan et al., 2024; Wang et al., 2025). The paper does not clearly articulate how PST substantively differs from or advances beyond these prior frameworks beyond its multimodal setting.
[1] Sun Z, Shen Y, Zhou Q, et al. Principle-driven self-alignment of language models from scratch with minimal human supervision[J]. Advances in Neural Information Processing Systems, 2023, 36: 2511-2565.
[2] Guan M Y, Joglekar M, Wallace E, et al. Deliberative alignment: Reasoning enables safer language models[J]. arXiv preprint arXiv:2412.16339, 2024.
[3] Wang H, Qin Z, Shen L, et al. Safety Reasoning with Guidelines[C]//Forty-second International Conference on Machine Learning.
1. The SRPO objective (Eq. 7) appears mathematically identical to standard DPO, with the same log-ratio formulation and sigmoid preference loss. Since both $y_w$ and $y_l$ already include reasoning traces, could the authors clarify what differentiates SRPO conceptually or algorithmically from conventional DPO? |
Moderately AI-edited |
|
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper
- Illustrates that reasoning VLMs generate harmful outputs at a higher rate than their non-reasoning counterparts.
- Introduces a multi-step data generation pipeline that generates policy-specific reasoning traces on the basis of text-image samples from the Beavertails-V dataset.
- The data generation pipeline produces both SFT as well as preference data.
- Fine-tunes existing reasoning VLMs on the generated data and evaluates the results on a range of safety and general vision-language reasoning benchmarks
- Extensive experiments across multiple benchmarks show improved safety performance for PST fine-tuned models while maintaining high general Vision Language reasoning capabilities.
- The method seems to outperform recent comparable methods such as MSR-align and Think-in-Safety.
- Paper claims that reduced attention to visual tokens is cause of safety degradation in reasoning models, but it doesn’t show whether the PST fine-tuning actually changes that behavior.
- The conceptual contribution is fairly limited as the paper follows a well-established pattern of using a multistep pipeline with powerful third-party models to curate a dataset that enables distilling of the in-context capabilities of these powerful models into smaller models and improve their performance on respective benchmarks.
- The data generation process between MSR-align, Think-in-Safety and the proposed method are all very similar. It is very difficult to understand what the main reason is for PST to outperform these prior works. Is it a different choice of prompts in the pipeline? A different choice of models? Specific steps in the pipeline? |
Fully human-written |