|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces Motivation-enhanced Reinforcement Finetuning (MeRF), a method that injects a natural language description of the reward function into the prompt during RLVR training to make LLMs aware of the optimization objective. This leverages in-context learning to improve efficiency over standard RLVR. Contributions include empirical evaluations on logic puzzles and math benchmarks showing performance gains, and analyses on motivation-reward consistency.
1. The approach creatively combines in-context learning with RL by explicitly providing reward rules as "motivation," offering a simple extension to existing RLVR paradigms that could inspire hybrid training methods.
2. Experiments cover multiple models (e.g., Qwen2.5 series) , with consistent comparisons to baselines, providing some evidence of improved accuracy and efficiency.
3. The paper is well-structured, with clear illustrations of the method, prompts, and results, making the core idea accessible.
1. The method is overly simplistic and lacks rigorous theoretical justification; it's unclear how the specific reward scoring rules (e.g., +2 for correctness, -1.5 for understandable but wrong answers) mechanistically influence the model's generation of correct reasoning trajectories, relying too much on intuition without deeper analysis.
2. Extensive experimental data is provided mainly for logic puzzles, but for more general tasks like mathematics and code generation, the motivation descriptions appear ineffective or irrelevant, as evidenced by smaller gains on MATH benchmarks and no code-specific results.
3. Experiments rely heavily on simple numerical comparisons (e.g., accuracy curves), lacking in-depth qualitative analysis, such as case studies of trajectory changes or failure modes, which fails to convincingly support the paper's motivation and leaves readers questioning the underlying mechanisms.
1. Could the authors provide a theoretical explanation or ablation on how the reward rules in the motivation prompt causally affect trajectory generation? For instance, why do negative scores for "understandable but wrong" answers guide the model better than a binary reward?
2. Why are gains on math tasks (e.g., only 3-4% average improvement) much smaller than on puzzles? Please elaborate on why the method may not generalize to code or other domains, and suggest experiments to test this.
3. The analysis section mentions Q1-Q4 but seems incomplete in the provided document. Can you expand on deeper insights, such as visualizing prompt-motivation interactions or reward hacking examples, to better convince readers of the method's value? |
Heavily AI-edited |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a simple, low-cost intervention for RL with verifiable rewards (RLVR): append a natural-language description of the reward ("motivation") to each training prompt. The claim is that exposing the policy to the reward structure during rollouts improves exploration (higher entropy, stronger pass@k) and yields consistent gains on K&K logic puzzles, several math benchmarks (AIME/AMC/MATH), and CountDown, even when the motivation text is removed at evaluation. The paper is properly evaluated and ablated.
Simple, well-motivated idea that's easy to implement; the paper reads clearly.
Consistent improvements over RLVR across two model families (Qwen2.5, DeepSeek-R1-Distill) and multiple reasoning benchmarks; importantly, performance holds without motivation at test time.
The method achieves better performance in fewer training steps. For example, in one experiment, MeRF achieved better pass@4 and pass@8 performance at step 140 than the final RLVR model did at step 280.
Currently the MeRF variant is compared only to the RLVR. Given the nature of MeRF consists of injecting the reward in the instruction, consider comparing against tuned-prompt variants via DSPY (https://github.com/stanfordnlp/dspy) , to see whether this benefit comes from better prompting.
The method's effectiveness is tied to tasks where the reward function is verifiable and describable in simple natural language. This limits the scope of MeRF, making it unclear how it would apply to tasks with more complex or non-describable reward signals, such as human preference scores.
The entropy analysis (Figure 4) shows higher entropy for MeRF, interpreted as "better exploration," but higher entropy could also indicate increased uncertainty. Alternative explanations aren't ruled out.
Can you provide evidence that models actually use the motivation during generation (e.g., attention analysis, probing)?
Catastrophic forgetting / over-alignment: After MeRF training, how does the model perform on unrelated general-purpose tasks? Any drop vs. base/RLVR?
Have the authors analysed how complex the motivation prompt needs to be? How sensitive are results to motivation wording, length, or position? Any robustness sweep?
How much improvement comes from simply having better prompts vs. reward-specific information?
How does performance scale with reward function complexity? |
Fully human-written |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
By providing evaluation criteria with the system prompt (find-grained guideline), RLVR pipeline could be effectively trained.
Through a series of experiments, the paper demonstrates that when training a model in a domain where the grading criteria are clearly defined, providing information about the grading scheme alongside the data can significantly accelerate the model’s learning process.
* As mentioned in the limitations, in situations where there is no prior knowledge of how the grading will be done, an approach that dynamically identifies the motivation and effectively solves the problem seems to have greater scalability.
* Since adding even a suboptimal motivation still provides non-zero additional information, it is unsurprising that the performance improves compared to RLVR. An interesting phenomenon is shown in Fig. 9, where performance increases after 500 steps when an adversarial motivation is provided. It would be important to check whether this result consistently appears across multiple runs, as the paper does not seem to report repeated experiments for this particular setting. Furthermore, when adversarial motivation is given as a guideline, it may be worth analyzing whether the RLVR training process includes any mechanisms that allow the model to ignore or correct such misleading guidance.
Naturally, ML model could align their answers with the evaluation criteria when grading guidelines are provided rather than only being told whether an answer is correct or not. But does that constitute a discovery? |
Lightly AI-edited |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes MeRF, a clever approach to improve RLVR for LLMs by injecting a natural language description of the reward function directly into the training prompt, which is termed "in-context motivation". This enables the model to be aware of the optimization objectives during generation, aligning its outputs with desired behaviors more efficiently than the traditional trial-and-error RLVR paradigm, which relies solely on random explorations. Empirical results across benchmarks, including K&K Logic Puzzles, MATH datasets, and CountDown number games, demonstrate substantial gains, and better exploration as evidenced by higher entropy during training. Ablation studies show that performance benefits primarily from the training process rather than inference-time motivation, with MeRF robust to suboptimal or even adverse motivations.
- A novel, simple and very practical approach to improve RLVR, which also makes sense
- Interesting experimental design and results on Q4
- Well presented (in terms of design) to make the paper easy to read
- The experimental results are scattered around the paper and somehow do not seem complete:
- Figure 1 includes results on 4 different LLMs and Figure 3 includes result on deepseek but most of them not presented in table 1.
- Results on Figure 2 (right) have no details; What dataset is this?
- Figure 1, 3, 7, 6, 5, 8 all show increasing performance on steps, but differently grouped (some on metrics, some on datasets), and feels very repetitive, being scattered all around the paper. Need to be better organized
- Figure 2 (right) and experiment Q3 (figure 8) basically telling the same thing but repeated
- A number of analysis experiments using different base models and datasets while there is only one single base model for main results, it makes me feel like the models are cherry picked for the analysis experiments.
- Main results need to be complemented with a number of different base models, to show that the method is robust between LLM choices
- Lots of repetitive figures (in terms of message), system prompt in the main text, a handful of main results, repeated analyses, all these make the paper less dense in terms of how much information it conveys.
- I would expect the improvement of MeRF to highly dependent on what reward function is used in the dataset. What happens if the reward function is much denser (having a lot of different criteria)? What happens if the reward function is less clear in terms of natural language (e.g. Math dataset)? There is no analyses on where the proposed approach would benefit best and where the proposed approach would benefit least.
- system prompts other than K&K puzzle not provided
- Some questionable analyses: continued in questions.
- Why does MeRF have high-entropy? and is it even a good thing? The paper motivates the need of MeRF like: we need it to have structured exploration instead of naive exploration of usual RLVRs. But entropy is more of a metric for "unstructured exploration"; usually I would expect smaller entropy when we move from unstructured exploration to structured exploration. Why is it not the case here? What happens if we control the temperature to increase entropy: does it help on RLVR?
- It is interesting to see pass@8 saturates fast with RLVR so that it is soon outperformed by MeRF. However, to claim that such early saturation is due to better exploration, I think the authors should also show that the samples generated by RLVR rarely have high rewards (higher than what LLM is currently getting in expectation). Such early saturation might be caused by different training dynamics due to different prompts. |
Fully human-written |