|
LACONIC: Length-Aware Constrained Reinforcement Learning for LLM |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a new method, LACONIC, which aims to improve RL-based reasoning algorithms like GRPO by enforcing target token budgets during training. The method is a primal-dual method that operates by first optimizing the policy followed by an update of the dual variable \lambda. Experiments across math reasoning benchmarks like AIME'24, MATH500 and general reasoning like GPQA, MMLU show that the method improves 1.5B base models by matching the performance of GRPO-variants while substantially cutting down response lengths, thereby reducing inference time costs.
- The paper is well written and presented.
- The method is well motivation: the problem of inference time costs increasingly largely due to growing response lengths is important, and the paper tackles it with an appropriate solution.
- The proposed LACONIC method which introduces a primal-dual optimization strategy is technically novel.
- All the conducted experiments are on small-scale models (1.5B models). Previous works like Sober reasoning (https://arxiv.org/abs/2504.07086) have shown that RL on small-scale models might not be reliable. Further experiments on larger and more diverse models (7B or larger) are required to ensure that the results are conclusive and transfer to real-world scales.
- The numbers in table 1 are categorically lower than the baseline numbers reported in the Sober Reasoning paper. Since this previous work suggests that strong tuning of baselines is important for making correct claims about the gains of RL, it is imperative to follow correct baseline reporting. It would be good and important to re-evaluate all the model checkpoints in tab 1 under the fair evaluation strategy proposed by that paper to convincingly show that the results hold strongly also under fair baselining.
- Some important baselines that also aim to shorten the response lengths are missing: https://arxiv.org/pdf/2502.04463, https://arxiv.org/pdf/2504.01296. This paper should either add results comparing against these baseline methods or add a discussion regarding why these baseline comparisons are not valid / required.
- The conclusion section claims that the LACONIC method can be flexibly used across RL algorithms like PPO / GRPO etc, however the paper only shows results for GRPO. Does the method indeed improve other RL algorithms like GSPO, PPO etc? If there are no additional experiments to back this up, the claim in the conclusion must be toned down.
All my questions are in the weaknesses section |
Fully human-written |
|
LACONIC: Length-Aware Constrained Reinforcement Learning for LLM |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a method, Laconic, to limit the output length of LLMs during training on reasoning problems. Laconic formulates the length restriction as a constraint imposed upon the usual reward maximization objective. The model first performs an update in the direction of reward maximization; then, the Lagrange multiplier governing the constraint is updated using a heuristic rule. The approach is tested on various reasoning benchmarks, including AIME-24, MATH, and Olympiad, where the model performs as well as or better than its counterparts while producing more concise outputs.
- The proposed idea is notably simple and highly effective. Various problems in RL (for example, safety) involve multiple constraints. This paper effectively borrows from these approaches by formulating the length restriction as a constrained optimization problem.
- The proposed approach results in a minor modification to the standard training process for LLMs on reasoning tasks. Empirically, it achieves marginally better or similar performance compared to its baselines, while using fewer output tokens.
- The experiments are thorough, and the ablation studies are well-executed. The inclusion of OOD (out-of-distribution) results is a valuable addition, especially since Laconic does not suffer a significant drop in performance when evaluated on reasoning problems outside the training dataset's distribution.
- The primary weakness of the paper lies in the update rule for the Lagrange multiplier proposed in Eq. 6. It is unclear how that update rule is obtained, and it does not appear to follow the standard procedure of constrained optimization, where partial derivatives of the Lagrange function are set to zero and the system of equations is solved simultaneously. Furthermore, the cost expression in Eq. 3 is unbounded, and in certain cases, it might outweigh the reward term, resulting in undesirable updates. The approach prevents the LLM from producing responses beyond the specified token budget in expectation, but it does not necessarily result in succinct responses; for instance, “what is 2+3?” could still result in an unreasonably long response that is within the token budget, which is unnecessary.
- Another significant weakness is that the Lagrange multiplier and the token budget are prompt-agnostic. This may have an unwanted effect in cases where a longer response is required for some questions, especially when the questions are sampled from a heterogeneous mixture of datasets (also as noted above in the “what is 2+3?” example).
- On the experimental side, the paper would benefit from including results where the Lagrange multiplier is treated as a fixed hyperparameter and found via hyperparameter tuning. Such results would help to validate the benefits of the proposed update rule. The experiments section could be further strengthened by including more baselines, including certain prompt-based baselines that instruct the LLM to produce shorter reasoning traces and responses.
- A surprising finding is the performance improvement on certain benchmarks when the response length is restricted, as the expectation would be to maintain the same performance or see a slight degradation. The authors should offer a hypothesis for this unexpected result.
- The paper should also address the counterintuitive performance drop observed across all datasets when the response length is increased from 2000 to 3000 tokens. This contradicts the improvement seen from 1500 to 2000 tokens and the general intuition that a larger budget should lead to better performance. An explanation for this anomaly is needed.
- The justification for the ceiling on the Lagrange multiplier is unclear. It appears arbitrary and lacks an obvious mathematical derivation. The authors should explain how this was determined.
- The paper would be strengthened by a discussion on selecting an appropriate token budget, particularly for unknown tasks. The authors should elaborate on the observed effects when the budget is set either too low or too high. |
Lightly AI-edited |
|
LACONIC: Length-Aware Constrained Reinforcement Learning for LLM |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces LACONIC, a reinforcement learning (RL) method designed to control the output length of large language models (LLMs) during fine-tuning. The authors identify that RL-tuned LLMs often generate verbose reasoning traces that inflate inference cost and latency. To address this, LACONIC reformulates RL fine-tuning as a constrained optimization problem that maximizes task reward while enforcing an average token budget constraint. Empirical results across multiple reasoning benchmarks demonstrate that LACONIC achieves similar or better accuracy compared to standard RL methods like GRPO. The paper further validates LACONIC's generalization to out-of-domain tasks and shows ablations confirming robustness to hyperparameters and computational efficiency improvements.
- The proposed solution is interesting. The method elegantly reinterprets length control as a constrained optimization problem rather than heuristic reward shaping.
- The paper is easy to follow.
- The major concern is that the experiments are only conducted on 1.5B-scale models. Also, the results of Qwen2.5-Math-1.5B-Instruct are not very strong compared with GRPO.
- While the primal-dual approach is conceptually sound, the paper lacks formal convergence analysis.
- Baselines are mainly GRPO and L1-based methods. It would strengthen the paper to include comparisons with more recent efficiency-oriented RL approaches (e.g., GFPO [1]).
[1] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.
Besides the weaknesses above, further questions are as follows:
- Why does LACONIC (3000) have higher performance than LACONIC (2000) in Table 3? |
Lightly AI-edited |