ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 3451
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 5.33 4.00 2191
Total 4 (100%) 5.00 4.00 2506
Title Ratings Review Text EditLens Prediction
Toward Efficient Exploration by Large Language Model Agents Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes using large language models (LLMs) to implement a well-established reinforcement learning (RL) algorithm of Posterior Sampling for Reinforcement Learning (PSRL). They show through experiments in bandit, tabular, and natural-language MDP environments that this “LLM-based PSRL” framework achieves more data-efficient exploration than other agent designs. 1. While paper does present Regret plots, it omits statistical significance tests. Can the authors please report those? 2. From my point of view, the paper has limited novelty. The core contribution is an application of PSRL through LLMs rather than a theoretical or methodological advance in RL or LLM architectures. Please let me know if I have misunderstood or didn't fully notice another novelty presented. 3. I understand that the authors are trying to present more transparent results to understand the importance of PSRL. But, to me, the experiments remain toy and fall short of demonstrating scalability or robustness in realistic open-ended tasks. Maybe the authors could consider adding more realistic domains? Embodied AI could be an option, maybe? 4. I am not convinced that the posterior that the uncertainty the authors refer to is actually calibrated. There’s no formal analysis of whether LLM-generated samples reflect valid uncertainty quantification or just linguistic variability. 5. There are many LLMs used in this paper. I would like to ask the authors to show an efficiency comparison (e.g., token or cost per reward improvement), especially with the models that are used. 6. There are some claims that I am not sure I understand: 'one might hope to see an LLM-based implementation of PSRL exhibit similar robustness in practice' - I am not sure what this means. This is not substantiated very well in the paper. There are no new regret bounds provided within the non-convex world of LLMs. So, I am not sure how to understand such claims. Like, the LLM introduces stochasticity and additional approximations. So do the theoretical results still hold? Please see above Please see above Please see above Fully human-written
Toward Efficient Exploration by Large Language Model Agents Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a new LLM-based algorithm designed to address the exploration inefficiency observed in existing LLM-agent frameworks. The authors adapt the classical Posterior Sampling for Reinforcement Learning (PSRL) algorithm, which is well known for its exploration efficiency, by replacing its three key subroutines with LLM API calls: (1) posterior sampling that imitates Bayesian (Thompson) sampling, (2) deriving and executing an optimal policy under the sampled model, and (3) updating the posterior for the next episode. The approach is compared against the classical PSRL formulation and other LLM-based agents such as Reflexion and ICRL. Across several simple, mostly tabular environments (e.g., Wordle, RiverSwim, and bandits), the LLM-PSRL agent achieves lower cumulative regret, attributed to more directed exploration. Empirically, the work demonstrates that an LLM can approximate Bayesian reasoning behavior through structured prompting and temperature-controlled stochasticity. The paper is clearly written and transparent about its limitations, providing sufficient implementation details to reproduce the results. The structure and presentation are excellent, and each component of the algorithm and experiment is explained in a straightforward and understandable manner. Conceptually, the idea of embedding a classical reinforcement learning framework (PSRL) within an LLM-driven architecture is both interesting and original. It demonstrates that language models can approximate Bayesian reasoning through prompting alone. This makes the paper an insightful contribution for researchers exploring how LLMs can emulate principled decision-making methods. The main limitation of the paper lies in its scope and the strength of its claims. While the proposed method demonstrates improved exploration efficiency, (1) its scalability to complex or real-world RL settings remains unclear. (2) The paper assumes that LLM agents inherit the exploration inefficiency of classical RL, *but this assumption is not empirically established and may not hold in the same way for language-based agents.* Consequently, the core motivation of addressing “poor exploration” in LLMs feels somewhat speculative. In addition, the decision to restrict comparisons to the LLM-agent design space (Reflexion and ICRL) limits the broader impact of the results. Including more diverse or recent LLM-agent baselines, once available, would make the contribution more convincing. Overall, the novelty is interesting, but the empirical and conceptual scope remains narrow. 1. You assume that LLM-based agents, like classical RL agents, suffer from poor exploration leading to data inefficiency. Could you provide stronger justification or empirical evidence supporting this assumption? How do you rule out other possible causes such as context limitations? 2. Have you verified that the LLM’s stochasticity aligns with true Bayesian sampling behavior, or is this assumption based on temperature-induced randomness alone? 3. Is your method scalable to higher-dimensional or continuous RL environments? If not, what are the main bottlenecks (e.g., token cost, prompt size, sampling variability)? 4. You mention that practical PSRL implementations beyond tabular MDPs face computational hurdles, yet your experiments are conducted only on tabular-like environments. Could you clarify why you chose not to include a non-tabular or higher-dimensional benchmark? Fully AI-generated
Toward Efficient Exploration by Large Language Model Agents Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission studies how large language models can be used to solve text-based decision-making problems under uncertainty. While prior work has largely focused on prompting or fine-tuning language models to solve decision making tasks, this work instead uses LLMs to implement certain aspects of classical RL algorithms. In particular, the authors show how to use language models to implement the classical Posterior Sampling for Reinforcement Learning (PSRL) algorithm (an extension of Thompson sampling from bandits to the RL setting) to text-based tasks. At a high level, they do so by instantiating PSRL with three LLM oracles: one to sample from a “textual posterior”, one to select actions given the current state that maximize value in a way that is consistent with the posterior sampling LLM, and one to update the PSRL agent’s knowledge and residual uncertainty about the world, akin to an (approximate) posterior update. The authors empirically evaluate their LLM-based PSRL algorithm on both tabular RL tasks and on text-based tasks on which classic tabular RL algorithms are not applicable. They find that their algorithm outperforms other methods of using LLMs for solving these tasks (particularly when instantiated with the o1-mini reasoning model). The authors’ proposed idea intuitively makes a lot of sense: we should try to leverage all of the great existing work on RL when trying to solve text-based decision-making tasks under uncertainty, instead of coming up with new algorithms. The choice to use PSRL also makes sense, as it is naturally amenable to being instantiated with LLM oracles. From a practical implementation standpoint, it is nice that LLM-based PSRL can be easily upgraded as newer models are released by just “plugging them in” to the algorithm. The paper’s writing was clear, and the experiments were fairly comprehensive (more on this below). While there are no major weaknesses, the ideas in this paper are not particularly deep, and the empirical results are not terribly surprising. It makes sense that LLM-based PSRL would out-perform other methods on text-based tasks, although I would have liked to have seen some experiments on a larger task to really show off the power of this method (e.g. a text-based role-playing game like dungeons and dragons). With that being said, I do not believe that these criticisms significantly detract from the benefits of this submission. Do you have any thoughts on how one would extend other RL algorithms to textual settings? Fully human-written
Toward Efficient Exploration by Large Language Model Agents Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces an implementation of PSRL (Posterior Sampling for RL) by using LLMs to approximate each step (posterior sampling, optimal policy rollout, and posterior update) purely via prompting. Extensive experiments on a variety of small-scale tasks (e.g Bernoulli bandit, Wordle) show the effectiveness of LLMs to minimize cumulative regret across increasing horizon lengths, outperforming related works. This work highlights the effectiveness of using LLMs to implement traditional RL algorithms that have been well-studied for their performance benefits. - Paper is well motivated and reasoning is sound. - Mathematical formulation is clear and alternatives (IDS) are explained and tested - Strong results on a variety of tasks, and performance improves with stronger LLMs. - Generalization to harder tasks is unclear. The formulation depends on using the LLM as an approximate optimal policy, but as we have seen LLMs are often not the optimal policy. Thus it is unclear, if the LLM is especially weak at the given task how well would this method work? - Significance of contribution. The authors advocate for implementing existing RL algorithms (like PSRL) via LLMs. However, this comes with the overhead of prompt engineering (+ some assumptions as mentioned in bullet point 1) that may make it harder to scale to new environments/tasks. - Effectiveness of exploration for RL training. PSRL is a well-studied exploration algorithm, however current work in LLMs + RL focus on RL finetuning LLMs mainly using policy-gradient methods. Are the trajectories generated by the proposed method useful to training LLMs with RL on the task (e.g. faster convergence, higher performance ceiling)? 1) How well does the algorithm perform if the base LLM is especially weak at the given task? 2) Are the trajectories generated by the proposed method useful to training LLMs with RL on the task (e.g. faster convergence, higher performance ceiling)? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next