|
What Matters for Batch Online Reinforcement Learning in Robotics? |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the problem of "batch online reinforcement learning," a paradigm where a robotic policy is improved by collecting large batches of autonomous data and then performing offline updates. The authors conduct a systematic empirical study to determine the key components for effective learning in this setting. They analyze performance across three axes: (i) algorithm class (imitation learning vs. value-based RL), (ii) policy extraction methods (explicit vs. implicit), and (iii) policy expressivity (Gaussian vs. diffusion policies). The main findings suggest that value-based methods significantly outperform imitation-based ones, implicit policy extraction is more effective than explicit extraction, and expressive diffusion policies are superior to less expressive Gaussian ones. Based on these results, the paper proposes a general "recipe" for effective batch online RL in robotics.
* The paper addresses a practical and important problem. The "batch online RL" setting, which involves collecting large batches of data for offline updates, is a sensible and scalable approach for real-world robotics, reducing the need for constant human supervision and avoiding the instability of purely online updates.
* The work is structured as a controlled study that ablates different components of the learning pipeline (algorithm, extraction, expressivity). This systematic approach helps to isolate the factors that contribute most to performance.
* A major weakness is the lack of motivation for choosing the three specific axes of analysis. The paper repeatedly refers to these axes but never explains why these are the most critical or representative components to study. The selection of only three algorithm classes (IL, filtered-IL, and value-based RL) also feels restrictive and lacks justification, especially when hybrid methods exist.
* Several of the key findings seem intuitive or are well-established principles in reinforcement learning. For instance, the conclusion that value-based methods lead to more diverse data and better performance than simple imitation learning is not a surprising result. The paper would be stronger if it more clearly situated these findings in the context of what is already known and highlighted what is genuinely new.
* The paper frames its final contribution as a general "recipe" for batch online RL, but the supporting experiments are conducted on a handful of simulation tasks and a single real-world task. This may not be sufficient evidence to support such a broad claim. Presenting the findings as a set of recommendations or case studies might be a more accurate and defensible framing. The claim that value-based RL is "necessary but not sufficient" is also very strong and may not be fully supported by the evidence.
* There appears to be a contradiction between the claims made in the text and the data presented in the figures. For example, the paper states that "we cannot get away with just doing IL or filtered-IL," but the results in Figure 5 suggest that these methods do show performance gains and scale with more data, even if they saturate earlier than value-based methods.
* Could the authors provide a stronger justification for why these three specific axes—algorithm class, policy extraction, and policy expressivity—were chosen as the focus of the investigation?
* The introduction states that current algorithms "struggle to fully leverage" autonomous data. Could you elaborate on the specific failure modes of prior works that motivate this study?
* Why was the comparison of algorithm classes limited to IL, filtered-IL, and value-based RL? There are other approaches, such as those that interpolate between an IL and RL loss, that could have served as relevant baselines.
* What was the rationale for using a diffusion-based policy as the default across all experiments?
* For the filtered-IL baseline, what was the specific threshold or criterion used to filter "low-quality" trajectories?
* In the experimental setup, the initial datasets are chosen to yield a base policy with a 30-65% success rate. How can we be sure that the conclusions drawn from this "realistic scenario" generalize to other starting conditions (e.g., starting from a much worse or better policy)?
* Regarding Figure 3, the plot lines are difficult to distinguish. Could the visualization be improved for clarity? It would also be helpful for the figure to be self-contained by explicitly mentioning that the value-based method shown is IDQL. |
Moderately AI-edited |
|
What Matters for Batch Online Reinforcement Learning in Robotics? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates algorithmic choices in the batch online RL setting, where a limited number of training and data collection iterations are interleaved. Three aspects are evaluated including value-based methods vs. imitation learning, policy extraction methods and the expressivity of the policy. These are done on simulated robotics environments and a real-world task.
The topic---batch online RL---is relevant to current applications of RL.
The paper is well-written. The use of color for denoting the sections is a nice touch.
This paper does a good job at exploring some important algorithmic considerations for the batch online RL setting and the findings are presented clearly.
Some findings are quite interesting, such as value-based methods outperforming the filtered IL after several iterations of batch training but not initially.
My main concern is the lack of certain baselines in the experimental sections.
- What are the previous baseline methods for the benchmarks considered? While I see mulitple comparisons in each section to examine the impact of various choices, I do not see any explicit comparison to prior methods. While achieving the best performance is not strictly necessary, it would be helpful to have some baseline numbers to get a sense of how well the evaluated algorithm is doing.
- In Fig.7, it seems like a natural baseline woudl be to also include the Diffusion policy with the explicit policy updates. Is there a particular reason this was not included?
This seems like a major reason why there is more diversity with the value-based method, which is presented as an advantage (line 282). It's not clear that Fig.4 is a fair comparison if the filtered-IL method used a Gaussian policy instead.
- Fig.4 is unclear. It is not entirely obvious that there is greater diversity with the value-based method, particularly on the "Square" task. Perhaps it would be better to add a numerical measure of the diversity of states.
Keeping either the 2d or 3d plots alone might be more clear also.
- Is the temporally-extended noise that important? Some papers have shown that independent Gaussian noise can be just as effective [1,2].
- How does the value-based approach compare to some model-based methods? Line 285 posits that value-based methods outperform the policy-based ones due to utilizing bad trajectories more effectively. This property would seemingly be shared by model-based approaches.
[1] CleanRL implementation of DDPG.
[2] "Addressing function approximation error in actor-critic methods" Fujimoto et al. |
Fully human-written |
|
What Matters for Batch Online Reinforcement Learning in Robotics? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper empirically studies what matters for batch online reinforcement learning for robotic tasks. In batch online RL, where agent training and data collection are decoupled, the agent alternates between online data collection by rolling out the current policy and offline policy improvement trained on all data collected so far. The authors hypothesize that an effective batch online RL method should be able to collect diverse data throughout the process. They should also be effective in learning from the diverse dataset collected. The authors examine three key components of the method design: (1) IL-based or Value-based RL (IQL), (2) explicit (AWR) or implicit policy extraction, and (3) expressive (Diffusion-based) or non-expressive (Gaussian) policy. The authors confirm that expressive value-based RL with implicit policy extraction is a feasible recipe for batch online RL in robotics, which achieves both high performance and scaling ability on six simulation tasks and one real-world task.
- The paper is well-structured and easy to read.
- The problem setting, batch online RL, is practical and promising. And it makes sense to decompose the problem into the proposed 3 components.
__There are missing experimental details I would like to verify further:__
- For experiments in Section 4.1, my understanding is: all three methods train a policy with the DDPM objective. The value-based RL one trains an additional Q function guiding policy rollout only. If this is the case, why does the initial performance of value-based RL differ from the IL baseline when both of them learn from $\mathcal{D}_0$?
- In Section 4.2, what is the training objective when the authors apply AWR to a diffusion-based policy with the DDPM objective? Did the authors reweigh the MSE by the advantages, like how IDQL does in their paper (Appendix F.5)?
- In terms of the real-world task, when applying the Steering baseline, why did the authors not train the Q function in a batch online RL manner? I am aware of the Steering paper’s setting and how it is deployed in this paper, but a reasonable variant should consider iteratively improving the Q function on new online rollouts (e.g., N=3, M=30). This also benefits the understanding of whether the performance bottleneck results from the value learning for online batch RL.
__Some arguments made in the paper were not well-supported:__
- In my opinion, the central claim of this paper is: “policies that not only can collect diverse trajectories but also can learn from this diversity matter for batch online RL” (Lines 064-066). Intuitively, I agree that expressive RL policies with value guidance should satisfy this requirement and thus work well for batch online RL. However, the normalized score is an indirect metric supporting this claim. Like Figure 4, I expected to see more direct quantitative analysis or visualization of the diversity of the collected data in each iteration, along the three proposed axes. For example, although an implicit Gaussian policy performs worse than the recipe, showing it might be able to collect relatively more diverse data for Threading and Square tasks can actually enhance the central claim.
- The argument made in Lines 346-351 is not supported with empirical evidence. Could the authors confirm whether implicit policy extraction can generate and leverage more diverse data? Could the authors confirm that explicit policy extraction cannot adapt policy to value functions trained on new data? Is the value learning the performance bottleneck, or the policy learning? Is the data collection the bottleneck?
- In Lines 395-402, could the authors provide direct evidence showing that an expressive policy can capture multimodality in action distribution in the initial iteration? How does the ability of modelling action distribution change across the iterations?
__Despite applying existing methods to a new problem setup, I felt the novelty of this paper is limited.__
- First of all, although I agree that diverse data collection and learning from diverse data are essential for batch online RL, I think the community has been aware of their necessity for offline RL, imitation learning, and online RL. Early prior study have shown their importance in each setup, like [1, 2, 3, 4]. Given that the batch online RL setup could be treated as consecutive offline RL or “low update-to-data ratio” online RL, what are the other unique insights the reader may grasp?
- For the second aspect of policy extraction, I recall that IDQL found that explicit policy extraction via AWR does not bring benefit to the DDPM-trained diffusion policy.
__Minor Issues:__
- In equation (1), the V function should be parameterized by $\psi$.
- In Line 292, it should be “the amount of data collected”.
- In Line 261, the notation of N is abused.
- In Lines 147-149, the authors argue that online RL risks of unsafe and unstable behavior. I wonder if in batch online RL, the agent also risks this since it is possible to conduct unsafe and unstable behavior during the data collection stage if the offline training was not sufficient. But I agree that this should happen less than the online RL paradigm.
- Compared with the network used for diffusion-based policy, the network used for the Gaussian policy in Section 4.3 is smaller (3-layer MLP). Although I do not expect that increasing the model size for the Gaussian policy can significantly improve performance, the authors should mention in the main body this experimental detail.
__References:__
[1] Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
[2] Behavior Transformers: Cloning k modes with one stone
[3] Synthetic Experience Replay
[4] The Primacy Bias in Deep Reinforcement Learning
Please refer to the questions above __"Minor Issues"__ part. |
Fully human-written |
|
What Matters for Batch Online Reinforcement Learning in Robotics? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies batch online RL setting for robot manipulation, where a fixed policy collects large batches of autonomous rollouts, then the policy (and optionally a Q-function) is retrained offline before the next deployment. The authors systematically vary three axes: (i) algorithm class (imitation learning or filtered imitation learning versus value-based RL), (ii) policy extraction (explicit versus implicit), and (iii) policy expressivity (Gaussian versus diffusion). They conclude that value-based RL with an expressive policy and implicit action selection, implemented by sampling multiple actions from the policy and choosing the argmax under the learned Q, works best, and that adding small Ornstein-Uhlenbeck noise during rollout yields further gains. Experiments on six simulated manipulation tasks and a real robot tape-on-hook task show strong improvements and better scaling trends than imitation-learning baselines.
The paper is well structured, easy to follow, and clearly motivated. Problem setting, assumptions, and goals are stated upfront, which makes the experimental choices and takeaways straightforward to understand.
The work decomposes batch online RL into three axes (algorithm class, policy extraction, and policy expressivity) and systematically compares design choices within each axis. This comparative study is valuable to the community because it clarifies which ingredients matter in practice, turns ad hoc intuitions into evidence-backed guidance, and provides an actionable recipe that reliably beats imitation learning baselines.
1. Limited novelty: The paper’s primary contribution is a careful, systematic comparison of existing ingredients rather than a new objective or algorithm. The proposed recipe largely recombines known components and matches IDQL when run for a single iteration, so the conceptual contribution feels incremental even though the comparative insights are useful.
2. Unclear hyperparameter tuning and fairness: The paper does not clearly explain how hyperparameters for different ingredients were selected, what ranges were explored, or whether tuning effort and budgets were balanced across methods and tasks. This makes it hard to attribute gains to the method rather than choices like K (implicit samples), β or expectile τ, diffusion steps, model capacity, or critic settings. I would suggest documenting the tuning protocol (search space, selection criteria, per-task vs global tuning, seeds)
See Weaknesses |
Heavily AI-edited |