|
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes PAG, a multi-turn reinforcement learning framework where the model alternates between generation and verification roles to perform self-correction. The key idea is to condition revision on the model’s own verification outcome by using a selective revision mechanism combined with turn-independent optimization and RoleAdvNoren, which stabilizes training and prevents the verifier from collapsing. The author evaluated PAG on multiple mathematical reasoning benchmarks including MATH500, MinervaMath, AIME (and some coding benchmarks in the Appendix), demonstrating gains over prior self-correction methods and single turn RL.
**[S1]** The paper is well written and the proposed method (PAG) is conceptually sound
**[S2]** The experiments are generally well executed across multiple datasets
**[W1: Unclear benefit over single-turn RL].** While multi-turn verification sounds intuitively useful, it remains unclear what concrete benefit it provides over strong single-turn RL methods (e.g., GRPO or PPO with a single verification step), which already achieve strong reasoning and self-verification performance [1]. Prior works have shown that single-turn RL-trained models can already conduct implicit self-checks or “rethink” their answers during inference [2]. Furthermore, the author should better clarify how its “single-turn” baseline is trained and ensure fair comparison with scaled GRPO/PPO baselines. Moreover, it would be valuable to show empirically whether multi-turn RL provides unique advantages beyond marginal performance gains.
**[W2: Missing baseline].** The paper lacks several important baselines. In particular, ReVISE [3], which is both efficient and effective (though based on offline RL), should be included for comparison against SCoRe [4]. Evaluating PAG against ReVISE would provide a fairer picture of its cost-effectiveness and practical advantage.
**[W3: Limited contribution of proposed components].** The proposed components (e.g., RoleAdvNorm and the reward shaping) appear to have limited impact. The ablation study suggests that their improvements are marginal and sometimes within noise level, as results under a single seed show very minor differences.
**[W4: Marginal overall gain while methodological complexity increases].**
The overall improvement appears too small relative to the complexity of the proposed framework (e.g., compared to SCoRe or direct multi-turn RL). Also, the combined effect of RoleAdvNorm and reward shaping produces less than a 1% improvement (up to 0.9% when used together). The benefit does not seem to justify the added algorithmic complexity. Typically, the improvement inthe coding domain is even more marginal.
**[W5: Relatively done on small scale (i.e., larger reasoning model can have different trend)].** Since the experiments are mostly conducted on relatively small models, the observed improvements might be within noise, especially on benchmarks such as AIME. It would strengthen the paper to include experiments on mid-sized models (e.g., 32B) to confirm whether the observed trends persist at larger scales.
**[W6: Somewhat limited novelty].** The novelty of the work is somewhat limited. The concepts of self-verification and revision have already been explored in prior works [3, 4]. This paper mainly introduces engineering extensions to enable multi-turn training. The work would be more compelling if the authors could better highlight the unique advantages that arise specifically from the multi-turn formulation.
Reference\
[1] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\
[2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model\
[3] ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification, ICML 2025\
[4] Training language models to self-correct via reinforcement learning, ICLR 2025
See the question above. |
Fully human-written |
|
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes PAG, a multi-turn reinforcement learning framework that enables an LLM to iteratively generate, verify, and revise its own reasoning outputs. By introducing a selective verify-then-revise mechanism, PAG avoids redundant second passes and stabilizes training compared to prior methods like SCoRe. The framework extends PPO with turn-independent optimization, a self-correction reward bonus, and RoleAdvNorm for separate policy and verifier updates. Experiments on various math reasoning benchmarks with Qwen-2.5 and Llama-3 show that PAG improves both reasoning accuracy and verifier quality, achieving superior test-time scaling through self-verifying Best-of-N sampling.
1. The paper is well written and clearly presented, making it easy to follow.
2. The model introduces three concrete and well-motivated components—turn-independent optimization, bonus reward, and
RoleAdvNorm—which are thoughtfully engineered to stabilize multi-turn RL training.
3. The empirical results are strong against baselines, especially in test-time scaling against majority voting.
4. The contributions are empirically well-demonstrated through detailed ablation studies and analysis.
5. The method shows strong empirical performance against baselines, particularly in test-time scaling, where it outperforms majority voting with self-verifying Best-of-N sampling.
1. The evaluation is limited to mathematical reasoning (Table 2), lacking experiments on logical or commonsense reasoning tasks that could better demonstrate generality.
2. The most significant concern is the absence of strong single-turn training baselines such as GRPO, DAPO, or DPO. To convincingly argue for the efficiency and effectiveness of training, comparisons with these methods are needed.
3. In Figure 5, the test-time scalability comparison between self-verify BoN and majority voting may be unfair, since self-verify incurs additional latency and token generation due to the verification step. A fair comparison should control for the same latency or total generation tokens.
1. What is the training computational cost of PAG compared to strong single-turn RL methods GRPO or DAPO? For a fair comparison, it would be helpful to report training-time, FLOPs or total generation tokens used during both training and test-time scaling.
2. How does PAG behave when extended beyond mathematical reasoning? For example, would the verify–revise dynamic still hold on logical or commonsense reasoning tasks where correctness is less binary?
3. In Figure 5, could the authors provide a comparison under equal latency or token budget conditions to make the self-verify vs. majority voting comparison fairer? |
Lightly AI-edited |
|
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a method for training LLMs to self-correct by training the same model with two different roles: 1) a generative verifier that takes in a question and a solution and verifies the correctness of the solution, and 2) a solution policy that attempts to find the correct solution. The LLM is trained using multi-turn RL where the trajectories consist of an initial solution to the given question by the solution policy, then a verification by the generative verifier, followed by another solution attempt by the policy, and so on. Due to the fact that two roles are present in the same trajectory, and the fact that multiple turns are taken, the paper proposes several modifications to the standard PPO algorithm, consisting of 1) turn-independent optimization, 2) self-correction bonus reward, and 3) a role-based advantage normalization. The method proposed does not require any initial SFT or RL fine-tuning of the base model, which is an improvement over existing works on self-correction.
Extensive experiments on 3 base models (Qwen and LLama) and across several math, logic, and code generation benchmarks shows the method to improve final performance. Results also show that the training procedure also results in both a strong verifier, and improved self-correction capability, both an improvement over existing works.
1. The proposed method simplifies over existing works on LLM self-correction by unifying verification and solving within the same model. As mentioned in the summary this removes the need for any SFT or RL fine-tuning as a warm start to RL training, which is a significant improvement.
2. The changes made to RL training mentioned in the summary are quite reasonable and minor. They are also shown through ablations to be necessary for getting the proposed method to work well. They confirm findings from prior works (such as the self-correction bonus) while avoids adding more complexity.
3. Good breadth of models and benchmarks were studied in the paper, with consistent results. Baselines selected for study are appropriately strong.
4. Evidence of test-time improvement with increasing number of turns, even beyond the number of turns used during training.
1. The improvements from PAG are modest, especially in Tables 2 (math) and 9 (code generation) when compared do SCoRe. Although this still could be considered interesting as the method is simpler.
Section 3.1, what does \hat{} mean in the notations? Seems unnecessary since there aren't same symbols with \hat{}?
Table 1, what's the evaluation benchmark? Where there only 2 turns in this setup (after initial solution)? |
Fully human-written |
|
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes policy as generative verifier (PAG), a multi-turn RL framework that trains a single LLM to alternate between two roles: (i) a policy that produces a solution, and (ii) a generative verifier that inspects that solution and decides whether a revision is needed ("verify-then-revise"). Unlike prior multi-turn methods that always produce a second attempt, PAG only revises when the model's own verification flags an error, aiming to avoid model collapse due to cosmetic edits or repeated answers while improving both reasoning and verification.
The method introduces three PPO-style adaptations: turn-independent optimization across turns, a bonus reward that encourages genuine improvements between attempts, and RoleAdvNorm, which normalizes advantages separately for the policy and verifier to prevent gradient interference. On mathematical reasoning (MATH500, MinervaMath, AIME24/25), PAG improves self-correction accuracy and shows strong verifier quality; notably, self-verify best-of-N (generate N candidates then let the unified verifier pick) outperforms majority voting.
1. This selective revision explicitly tackles collapse and yields better revision efficiency than always-revise baselines. Originality lies in unifying policy and verifier into one model with a *selective* verify-then-revise loop, plus stable multi-turn RL adaptations (turn-independent advantages + RoleAdvNorm + improvement bonus).
2. On Qwen2.5-7B, PAG reaches 82.3% Acc.@final on MATH500 and the best average 38.3%, outperforming Single-Turn, Direct Multi-Turn, and SCoRe; verifier performance on RewardBench also improves markedly (e.g., 86.6).
3. The self-verify BoN vs. majority voting result is practically useful.
1. Core results emphasize primarily on math; extensions to logic/coding are in the appendix but with much less emphasis.
2. The baselines are a bit "tricky", SCoRe is re-implemented (I know the code is not released), and for non-rl baselines such as self-consistency with majority voting at large N (policy-only BoN), the experiments are not under the same compute budget and decoding settings.
3. Appendix shows that dropping reward on either the first policy turn or final output collapses one role, I am concerned about robustness.
1. Could you add main-paper results on at least one non-math domain (e.g., code repair or science QA)?
2. Could you include a small experiment comparing with/without KL? |
Fully human-written |