ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (100%) 4.00 3.67 3238
Total 3 (100%) 4.00 3.67 3238
Title Ratings Review Text EditLens Prediction
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Post-training methods for large language models (LLMs) constitute an active area of research. However, reinforcement learning (RL)-based fine tuning is very challenging due to the generative nature of LLM outputs. In particular, the design of efficient reward functions is difficult. To address this issue, the authors introduce Reinforcement Learning from Dynamic Critic Feedback (RLDCF), a framework that focuses on post-training as an adversarial game between a generator and a critic. In RLDCF, each instruction/prompt provided to the LLM is associated with a set of rubrics representing task-specific requirements that the output should satisfy.  The objective is then to train a generator that maximizes the probability of providing correct outputs.  The critic, modeled as a stochastic policy, aims at providing the worst-case criterion for a given instruction–action pair. The generator is then trained by solving a mini–max optimization problem. Both the generator and critic are implemented as LLMs fine-tuned using a DPO loss. The proposed approach is evaluated using text and code generation tasks. For factual text generation, experiments are conducted on the Wikipedia Biography Dataset using base generators Qwen3-4B and Qwen3-8B, compared against three baseline methods. For code generation, the authors employ the AceCode-87K-hard subset, with base generators Qwen2.5-Coder-7B-Base and Qwen2.5-Coder-7B-Instruct, also benchmarked against three baselines. Post-training methods for large language models is a crucial task to improve task specific use of generative models and provide robust LLMs. The presentation of the paper is clear, the problem is well motivated and the overall description of the method is good. Experimental results demonstrate that RLDCF yields competitive results in both text and code generation quality, highlighting the effectiveness of adversarial critic feedback to finetune. - In the text generation experiment, RLDCF achieves the same level of factuality as FacTune-FS with fewer verification calls, it also improves KL divergence along epochs with monotonic FactScore gains. - In the code generation task, the proposed approach outperforms enumerative method and static reward model method for most benchmarks. The results are interesting and promising on the two proposed tasks. However, as the paper is mostly experimental, I would expect more discussion on the choice of the methodological choices. For instance on the way the critic and generator are updated. The influence of K (candidate outputs for each instruction) or N (number of criteria sampled from the critic) should be strong on the results. Even if no theoretical guarantees are provided (which is a probably a very hard question), I would expect more discussions on these "hyperparameters". In the experiments, it is hard to assess the statistical significance of the results are there is no uncertainty quantification (standard deviation of the metrics for independent runs for instance). Although adding additional experiments or simulations is not strictly necessary to demonstrate the soundness of RLDCF, the contribution being primarily methodological, its impact would be strengthened if the empirical evaluation, especially for the factual text generation task, were illustrated with a broader variety of performance results. Minor weakness: the paper would benefit from a careful proofreading to remove typos (Appendix and Tables wrongly referenced for instance). In both experiments N and K are fixed. However, these parameters should have a great impact on the results. Can you discuss this ? These hyperparameters are different in both experiments. Is there a reason for this ? In the experiments, can you add some qunatification of uncertainty (std over various runs for instance) ? Is it possible to highlight the performance of RLDCF on other text generationt tasks to support the applicability of the method ? For instance using datasets used in papers associated with the baselines (medical question answering of [Tian et al., 2024] for instance). Does changing the backbone model for the generator and critic have an influence on the experimental conclusions (factuality level/number of calls, dynamics of the KL divergence along epochs) ? Fully human-written
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles the important issue of the most likely failure case of generated free-form text by introducing a novel adversarial learning scheme for training a generator and critic. The generator outputs multiple candidate solutions and the critic, given a candidate, finds the case to most likely fail given the generation. They employ a DPO loss for both the generator and critic. They have consistent results on factual text generation and code generation showing improved performance over baseline methods. -The paper is well-written and easy to follow -Paper outperforms baselines methods in both factual text generation and code generation -The proposed approach is novel, applying adversarial learning to the important, significant problem of identifying failure cases in free-form generation -I find the use of DPO fairly interesting and novel -For fact verification, I do not necessarily see the need for a critic to specify which fact the check. Can one not separate out all the facts in the generated responses, either programmatically or with an LLM, and then run each one through the fact verifier? I understand from later on in the paper the verification is costly, but verifying each fact would provide more accurate rewards for the generator, correct? For code generation, I understand this simplification is not possible because the test cases cannot be parsed from the generated code. -If I understand correctly, given a candidate answer a_i to s, the reward that is used to fine-tune the generator is sparse. Therefore, if even one fact is incorrect or one test case is not correct, the whole candidate output is assigned a reward of 0, correct? If so, in the case of code generation, that may be too strict because realistically code is regularly updated to handle unseen test cases, not marked as entirely wrong. Also for fact generation, if one fact is wrong, the sparse reward does not help the generator learn which fact was wrong. I understand the issue may be mitigated by the sampling multiple candidates and using DPO loss, so similar candidate solutions with different rewards can help the generator learn with finer-grained feedback. However, how many times, especially in the beginning of training, do you get a group of candidates with reward = 0 and reward = 1 to provide some distinction to the generator? -I’m not too familiar with FactScore, so I am not sure what is the bottleneck cost for verification mentioned in L249. But for code generation, I do not see why number of test cases executed is a good metric for efficiency analysis. Test cases, especially for the ones provided in the used datasets, can be quickly executed. See Weakness 1 and 2 Minor Suggestions/Weaknesses: -L86 has two empty parentheses -Missing reference in L248 Fully human-written
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles a real pain point in training large language models: how do you optimize for tasks where outputs need to satisfy tons of different criteria, and you can't possibly check them all? The authors propose RLDCF, which basically turns training into a game between two models. A generator tries to produce good outputs, while a critic tries to catch the generator's mistakes by proposing specific ways it might fail. An external validator then checks if the critic found a real error. The critic gets better at spotting weaknesses, and the generator gets better at avoiding them. The idea is neat: instead of exhaustively checking every possible criterion or relying on a static reward model that can be gamed, you dynamically focus on the most likely failure points. The paper suggesr promising results. For biography generation, they hit a FactScore of 0.889 while doing 5.7 times less verification work than existing methods. For code generation, they claim the best scores despite using only 9% of the training data. However, a potential problem in code generation experiments suggest circular logic: they essentially created "reference solutions" using the same model family they're training. Then the model could just be learning what Qwen is already able to do. Qwen-7B-Instruct training shows limited improvement over the base model, could be well within variance, and the authors didn't provide much details on what or how the model is improved. The core idea of adversarial training idea isn't particularly novel, similar approaches have appeared in recent work on generative verifiers. The biography experiments are more solid and the overall problem they're solving matters. But between the circular validation issues, limited novelty, and some unfair experimental comparisons, this feels like a decent idea that needs another round of work with more solid experiments and applications that can justify sufficient contribution to the area. - Important and natural problems to tackle for rubric based reward modeling and RL training for LLM post training. - Theoretical formulation is solid. - Strong factual text generation results. - Good ablation studies. - 4 to 8 sentence comparison shows the method scales well to complexity - Presentation is clear - I have doubts about code experiment set up as mentioned above - Lack of theoretical or empirical analysis into the method and experiment results - Could benefit from more analysis & learnings and more experiments. I'm open to change my score if authors can provide sufficient justification or insights in both of the experiments, especially the coding one Fully human-written
PreviousPage 1 of 1 (3 total rows)Next