ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 6.00 3.50 6994
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 5.00 4.00 1760
Total 4 (100%) 5.50 3.75 4376
Title Ratings Review Text EditLens Prediction
VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes VeriRole, a framework that uses GRPO to train role-play agents by explicitly incentivizing them to do a hint stage before thinking and replying. VeriRole designs reward function for hint, accuracy, and format, which combined are fed into GRPO for optimization. Authors empirically demonstrate that VeriRole outperforms baselines in terms of final accuracy and human-centered metrics in RAIDEN and CharacterEval. * The paper is mostly well-written, clearly organized, and easy to follow. The workflow of VeriRole is made very clear, as well as the three reward components. * Authors have conducted extensive experiments across datasets and model family and sizes, which all confirm the superiority of their method. * The design for the hint reward is rather ad-hoc, which is composed of several components that are then summed up. There is no explanation as how to pick the weight or why these should be added up or even chosen. * There are many hyperparameters for the reward functions in this proposed system, such as the weights in hint reward, format reward's 0.6/0.0 reward, but there's no experiment that shows how sensitive the system is to these hyperparameters or how to set them. * All these benchmarks are with role-playing, but it is unclear how this method, which seems to be general, can be transferred to broader reasoning tasks. * An ablation of no hint reward (instead of no accuracy reward) is needed to validate the incorporation of the hint module. 1. Is there anything that incentivizes the response to "use" the hint that's generated? 2. For Qwen3 models with thinking mode on, what exactly is the response like? It is still producing <hint> before <think>? 3. Can the hint part be viewed as part of the thinking trace? Is it possible to directly incentivize the thinking traces to include these hints? 4. In Fig. 4a the hint reward tops near 0.7, but Fig. 4b’s x-axis goes to 1.0. Is 4a truncated or are the scales/normalizations different? Please clarify. Fully human-written
VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles the "non-verifiability challenge" inherent in optimizing Role-Playing Conversational Agents (RPCAs) via Reinforcement Learning (RL) . Due to the creative and open-ended nature of role-playing, designing verifiable reward signals is difficult. The authors propose VeriRole, a framework that introduces a structured reasoning process. The core contribution is a "hint mechanism" that first compels the agent to extract deterministic cues from the context (profile, history). These hints are then used to calculate a Verifiable Role-Awareness Reward (VRAR) , a multi-component reward function that scores the quality of the hint (via comparison to a ground truth) , the factual accuracy of the final response (where applicable) , and the adherence to a structured format. The framework is optimized using Group Relative Policy Optimization (GRPO) on a combination of the RAIDEN benchmark and a newly introduced Situation Puzzle Dataset. Experiments demonstrate significant gains on the RAIDEN (+18.9%) and CharacterEval (+4.55%) benchmarks over baseline models. The paper's primary strength is its original and pragmatic approach to the well-known "non-verifiability challenge" in applying RL to creative, open-ended tasks. Instead of trying to create a subjective reward model for the entire creative response (which is difficult), the authors clearly decompose the problem into a verifiable fact-extraction step (the "hint") and a creative generation step. This decomposition is significant as it allows for the design of an objective, verifiable reward function (VRAR) that can be reliably optimized. The quality of the execution is high, with a well-designed multi-component reward, the introduction of a new reasoning-focused dataset, and strong empirical results (+18.9% on RAIDEN ) that validate the framework's effectiveness over both standard instruction-tuning and SFT baselines. 1. **Conflation of Hint Reward with SFT Loss:** My primary concern is with the **Hint Reward ($R_{hint}$)**. This reward is computed by measuring the semantic and lexical similarity (e.g., $Sim_{cos}$, $S_{ROUGE}$) between the generated hint ($H_{gen}$) and a ground-truth hint ($H_{gt}$). This objective seems functionally very similar to a standard Supervised Fine-Tuning (SFT) loss (e.g., cross-entropy) on the hint tokens. While the paper compares GRPO to an SFT baseline on the *entire* structured output, it doesn't convincingly disentangle the gains. It's unclear if the improvement comes from the RL algorithm (GRPO) itself or simply from having a very strong, explicit optimization signal on the hint-extraction part, which an SFT loss could also provide. A more informative baseline would have been a mixed-training approach (e.g., SFT loss on hints, RL on the reply). 2. **Scalability and Reliance on Ground-Truth Hints ($H_{gt}$):** The entire framework is predicated on the availability of a ground-truth hint ($H_{gt}$) for every training sample to compute $R_{hint}$. This annotation process was feasible for the structured, fact-based RAIDEN dataset and the solution-oriented Situation Puzzles. However, this appears to be a major annotation bottleneck that severely limits the framework's scalability. The paper does not discuss the feasibility or cost of acquiring these $H_{gt}$ annotations for more general, open-domain role-playing corpora, where the "correct" cues to extract are often subjective and not explicitly stated. 3. **Reward "Hole" for Creative Tasks:** The **Accuracy Reward ($R_{acc}$)** is only applied to sub-tasks with definitive answers (SBK, SCK, CM, and puzzles). The paper explicitly states that for more open-ended categories like Role-Cognition Boundary (RCB) and Topic Advancement (TA), samples "are not evaluated for accuracy reward computation". This is a significant limitation. It means that for the *most creative* and abstract role-playing skills, the agent is *only* being rewarded for $R_{hint}$ and $R_{format}$. It receives no reward signal on the *actual quality of its final creative reply* (e.g., how "in-character" a refusal was, or how "engaging" a topic shift was). This seems to undercut the very purpose of using RL, as it fails to optimize the final output for the tasks where SFT is weakest. 1. **($R_{hint}$ vs. SFT Loss):** Regarding the Hint Reward ($R_{hint}$), which relies on similarity to $H_{gt}$ : Could the authors clarify how this is fundamentally different from a standard SFT loss on the hint portion of the generation? What are the benefits of this RL formulation over a simpler mixed-objective model (e.g., SFT loss on $\langle\text{hint}\rangle$ + SFT loss on $\langle\text{reply}\rangle$, or SFT on $\langle\text{hint}\rangle$ + a different policy-gradient objective on $\langle\text{reply}\rangle$)? 2. **($H_{gt}$ Scalability):** The reliance on ground-truth hints ($H_{gt}$) seems to be a significant annotation bottleneck. How do the authors envision this framework scaling to more general, open-domain role-playing datasets where a single, verifiable $H_{gt}$ may not exist or may be subjective? Does the framework collapse if $H_{gt}$ is not available for a large portion of the training data? 3. **($R_{acc}$ on Creative Tasks):** The Accuracy Reward ($R_{acc}$) is not applied to open-ended categories like Role-Cognition Boundary (RCB) or Topic Advancement (TA). This implies that for these creative tasks, the agent is *not* being rewarded on the quality of its final response, only on its ability to extract a hint (which might be "empty") and follow the format. Doesn't this fail to optimize for the *actual* creative role-playing skill in these scenarios? How does the model learn to generate *better* (e.g., more engaging, in-character) refusals or topic shifts if the final reply isn't part of the reward signal? 4. **(Ablation Follow-up):** Following on Q3, the ablation study (Table 1) shows that training on "Raiden-Only" (which includes RCB/TA data) outperforms "Situation-Puzzle-Only" on the RCB and TA metrics. Since $R_{acc}$ isn't used for this data, what reward signal is driving this improvement? Is it purely the $R_{hint}$ associated with the RCB/TA samples? This would be a very strong claim—that just learning to extract *requirement-based hints* is enough to improve the *final creative output*. 5. **(Reasoning Pitfall):** The paper finds that general reasoning ability (from CoT) can *hurt* role-playing, leading to "overly-formal language". However, this framework *adds* an explicit reasoning step (`<think>`). How does the VeriRole framework, particularly with the logic-heavy Situation Puzzle dataset, avoid this same pitfall? Fully AI-generated
VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission proposes VeriRole, a framework consists of a hint mechanism and a Verifiable Role-Awareness Reward (VARA) optimized by GRPO. It addresses the challenge of maintaining role-awareness in role-playing conversational agents (RPCAs). The results show superior performance compared to several LLMs on RAIDEN benchmark for factual and role consistency and Situation Puzzle dataset for complex reasoning. 1/ The paper clearly identifies the non-verifiability gap for the reward design in RPCA. 2/ The methodology is well-thought that includes three verifiable components, focusing on not only the hint itself but also maintaining high accuracy and desired formatting. 3/ The experiment is comprehensive that includes multiple benchmarks on not only role-playing but also basic reasoning ability. 1/ Accuracy and quality judgements heavily rely on GPT-4o/Claude 3.5, which could possibly introduce bias. Although the author also introduces CharacterEval, this is in Chinese and may not fully reliable for evaluating other languages. 2/ The ground truth generation is data quality dependent, the pipeline to generate the ground truth seems complex while the policy is highly dependent on these. 1/ What is the magnitude of different component reward? Is there any weighting between different components to balance them? 2/ Related to the second point in weakness, the design of hint reward seems heuristic to me, how do you convincingly demonstrate that the current ground truth hints are the best ones? Fully human-written
VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses a core challenge in Role-Playing Conversational Agents (RPCAs): maintaining role-awareness. The authors argue that the creative and open-ended nature of role-playing makes it difficult to design verifiable reward signals for Reinforcement Learning (RL). To solve this "non-verifiability challenge," the paper proposes VeriRole, a novel framework. The core of this framework is a "hint" mechanism, which first extracts deterministic, verifiable factual cues from the context (including character profiles, dialogue history, and role-playing requirements) before generating the main response. Building on these hints, the paper introduces a "Verifiable Role-Awareness Reward" (VRAR). This reward function is composed of three parts: a hint reward ($R_{hint}$), an accuracy reward ($R_{acc}$), and a format reward ($R_{format}$). This VRAR signal is then used to optimize the model via Group Relative Policy Optimization (GRPO). To support this framework, the authors also constructed two specialized datasets: a filtered subset based on the RAIDEN benchmark and a novel "Situation Puzzle Dataset" designed for complex reasoning. Experimental results show that the Qwen2.5-32B model, optimized with VeriRole, achieves significant performance increases of 18.9% and 4.55% on the RAIDEN and CharacterEval benchmarks, respectively, demonstrating the method's effectiveness in quantifying and improving role-awareness. 1. Problem Importance and Novelty: The paper tackles a very important and difficult problem in the RPCA domain: how to define a reliable and verifiable reward for RL in a creative task. The approach of decoupling the open-ended "role-playing" task into two stages—"verifiable fact extraction" (hint) and "creative response generation"—is a novel and insightful entry point. 2. Methodological Soundness: The VeriRole framework is well-designed. - The "Hint" mechanism provides an excellent bridge connecting factual constraints with creative generation. - The VRAR reward function design is comprehensive. It rewards not only the accuracy of fact extraction ($R_{hint}$) but also the content correctness of the final response ($R_{acc}$) and the compliance of the generation structure ($R_{format}$). This multi-faceted reward signal provides effective guidance for RL training. 3. Strong Empirical Support: The paper conducts comprehensive experiments on multiple models (Qwen series, Peach) and two key benchmarks (RAIDEN, CharacterEval). The results (e.g., +18.9% on RAIDEN) show that VeriRole significantly outperforms SFT and standard Instruct models, providing strong evidence for the proposed method's effectiveness. 4. Valuable Analyses: - The ablation studies (Section 3.4) clearly demonstrate the contributions of different datasets (RAIDEN vs. Situation Puzzle) and reward components (removing $R_{acc}$), validating the integrity of the framework's design. - The reward analysis (Figure 4b) reveals a strong positive correlation between "hint reward" and "final accuracy," offering direct evidence for the method's core hypothesis (i.e., high-quality hints lead to high-role-awareness responses). - The comparison against SFT (Section 3.3) also highlights the advantage of RL in learning abstract skills (like Topic Advancement). 1. Dependency on Ground-Truth Hints: The success of the entire framework (especially $R_{hint}$) relies heavily on the high-quality "ground-truth hints" described in Section 2.2. These hints are generated using LLMs and a series of complex heuristics (like Question-Type Filtering, Cardinality Constraint). - Scalability: How high are the cost and complexity of this annotation pipeline? Is it feasible to build such high-quality hint datasets when scaling to new characters or domains? - Sensitivity: To what extent is the framework's performance sensitive to the quality of these ground-truth hints? If the hint extraction annotations are noisy or incomplete, the reward signal could be misleading and misguide the RL training. 2. Trade-off between Creativity and Constraint: - The paper claims the method "preserves creativity," but the experimental results (Table 2) show that VeriRole's improvement on CharacterEval (+4.55%) is much smaller than its improvement on RAIDEN (+18.9%). - RAIDEN appears to focus more on factual consistency (e.g., SBK, SCK, CM), while CharacterEval focuses more on engagingness and overall dialogue ability. Does this imply that VeriRole's gains in "strengthening factual constraints" are much larger than its gains in "protecting or enhancing creativity"? - The $R_{acc}$ and $R_{format}$ rewards (especially the latter) might overly penalize responses that are structurally "non-compliant" but creatively valuable, leading to more conservative and homogenous model behavior. 3. Generalization of the Situation Puzzle Dataset: Introducing the "Situation Puzzle Dataset" to enhance complex reasoning is an interesting idea. However, how does this "logical puzzle-solving" type of reasoning relate to the "emotional reasoning" or "narrative reasoning" more commonly required in RPCAs? The paper lacks validation of whether the abilities trained on this dataset can effectively generalize to common role-playing scenarios that require empathy, plot advancement, or handling subtle social dynamics. 4. Justification of Algorithmic Choices: The paper uses GRPO as the optimization algorithm but does not sufficiently justify why GRPO was chosen over more standard algorithms (like PPO). What specific advantages does GRPO offer for this task compared to PPO? If the goal was merely to use a $D_{KL}$ penalty, PPO could also achieve this. The lack of a comparison with PPO+VRAR makes the necessity of GRPO unclear. 1. Regarding Hint Data Construction: Could the authors elaborate on the estimated cost and difficulty of building "ground-truth hint" data for new characters or domains? Furthermore, how sensitive is the VeriRole framework's performance to the quality of these hint annotations? 2. Regarding the Creativity Trade-off: As mentioned in the "Weaknesses," does the large improvement on RAIDEN versus the smaller one on CharacterEval suggest that VeriRole is more geared towards improving "factual accuracy" rather than "role-playing creativity"? How do the authors view the potential (negative) constraints that the $R_{acc}$ and $R_{format}$ rewards might impose on creative expression? 3. Regarding the Situation Puzzle Dataset: How does the reasoning ability trained by this dataset (puzzle-solving) differ from the "role-awareness" needed for conventional role-playing (e.g., emotional support, chit-chat)? Is there evidence that training on this specific task can generalize to broader RPCA scenarios beyond just improving factual accuracy (like SBK, CM)? 4. Clarification on Ablation Study: In the ablation experiments in Table 1, "GRPO-Raiden-Only" (Avg 0.8143) performs better than "GRPO-No-Accuracy-Reward" (Avg 0.8061). Does this imply that the contribution of the $R_{acc}$ reward (which primarily comes from the Situation Puzzle data and RAIDEN's SBK/SCK/CM) is relatively small? This seems slightly at odds with the conclusion at the end of Section 3.4 about "the importance of the accuracy signal in ensuring factual consistency." How do the authors explain this observation? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next