ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 4 (100%) 5.00 4.25 3655
Total 4 (100%) 5.00 4.25 3655
Title Ratings Review Text EditLens Prediction
The Three Regimes of Offline-to-Online Reinforcement Learning Soundness: 2: fair Presentation: 4: excellent Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces an empirical framework to predict which kind of offline-to-online RL algorithm will perform better for a given dataset and policy, based on the relative performances of the data-collection policy and the given policy we will fine-tune. The proposed takeaways are that if the given policy performs better than the data-collection policy, methods that provide stability around the pre-trained policy will perform better, whereas if the data-collection policy is better, methods that provide stability around the dataset will be better. - The authors provide very thorough empirical results. They test 21 dataset-task tuples, and 3 pre-training methods per tuple. - The writing is generally clear and focused. - Code is released, which is appreciated to support reproducibility. I have key concerns regarding the definitions of the family of methods and the interpretations of the results, which leaves me unconvinced about the central claims in the paper. **Key concern**: why does offline RL regularization during fine-tuning classify as stability with respect to the pre-trained policy $\pi_0$? If my understanding is correct, the same method used for the offline phase is used during the online phase for offline RL regularization. So for example, in figures 3, 4, 5, if CalQL was used for pretraining, then the same CalQL coefficient will be used for online fine-tuning. Why do the authors interpret CalQL as providing stability with respect to the pretrained policy $\pi_0$? It does not keep a frozen copy of this policy at all. I would actually argue it provides stability relative to the offline dataset $D$, since it increases the Q-values of state-action tuples in the dataset, and decreases for on-policy and random action samples. For BC-pretrained policies, does offline RL mean TD3+BC in this paper? If so, the same argument applies - this does not provide stability w.r.t. the pretrained policy. With correct labeling of methods, conclusions in the paper might change significantly. **Clarity / other concerns:** - Line 216 Minimal baseline is introduced, but it is not explained? It is then mentioned again in line 268, but it’s still unclear (“minimal baseline corresponds to maximum plasticity with no explicit stability mechanism”, so SAC?) - Line 258: I don’t think this perspective on RLPD is meaningful, since there is no pre-training at all. This should be rephrased to make this clear (right now might read as “some networks are kept, while others are reset for plasticity”). - Results on figure 3: offline RL + offline data should be equivalent to CalQL right? Why is performance significantly worse than the reported numbers in the CalQL paper? E.g. halfcheetah-medium-v2, from a CalQL init (top middle plot), the green line plateaus at 60%, but CalQL reports 93% after fine-tuning. Likewise with hopper-random-v2, and to a lesser extent antmaze-large-diverse-v2. - Line 356: “This aggregate outcome strongly supports our principle that, in the superior regime, …” I don’t see strong support from figure 3. The green method is a method that tries to stick close to dataset behaviors, and it is the best performing-method on the top left figure. The top right figure shows large overlapping confidence intervals. Same with walker2d-medium-replay-v2, BC. - The methods compared are not explained properly. E.g. on figure 3 for “walker2d-medium-replay-v2, BC”, it is clear that you pre-train the policy with BC. However, how do you apply offline RL? Do you pre-train a critic, or start the critic from scratch for the online phase? If so, then the orange line (offline RL without offline data) is not meaningful. What type of offline RL regularization do you apply for the orange and green lines? - In figure 5, it doesn’t look like $\pi_0 \approx D$. The green line, which is a method that tries to stay close to the support of the dataset (i.e. a stability relative to the offline dataset approach) seems to perform better than other methods. Eqn 4: shouldn’t it be J_off^* instead of min J(pi_j)? Otherwise this doesn’t measure how much we can improve during fine-tuning, but rather the range of values during fine-tuning, which might be much larger if stability is low. Fully human-written
The Three Regimes of Offline-to-Online Reinforcement Learning Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper discusses three regimes in offline-to-online RL, and analyses the design decisions in different algorithms to show which ones are suited for which regimes. The three regimes are defined as the difference between the performance of the offline RL policy, and the performance of the data collection policy. Superior regimes (pretrained policy better than dataset) require online policy to stay close to initialization; inferior regimes need to retain the offline dataset, and parameter reset can also be beneficial; comparable regimes yield similar performances. - offers a unique insight into the different regimes of offline-to-online RL, and offers a great discussion for the community into why certain RL algorithms are better than other for certain regimes. This is the first work that I am aware of that classifies different regimes based on the performance different of the pre-trained policy and the data collection policy. - writing is super clear and clearly explains the different regimes and which methods excel. I appreciate the take away section at the bottom of every results section. - clear and comprehensive experiments to support the paper’s claims - Section 3.2 offers definitions of new concepts such as stability, plasticity, and knowledge decomposition. However, as far as I can tell, these definitions were never used or referenced in the subsequent text (which only used J(pi_0) and J(pi_D) to define the three regimes). Why define these concepts here? How are they useful? Why are they defined this way and not other forms? With no discussion, their definition seems a bit random. - This paper mainly focuses on off policy RL algorithms for offline-to-online RL, yet, it misses any discussion of one critical part of off policy RL – learning a Q/value function. How does the quality of the Q function change in the three different regimes? And how does the quality of the Q function impact fine-tuning performance? Without discussion of the Q function, the paper drew conclusions only on policy performance (J(pi_0) and J(pi_D)), but this is not a full picture to characterize the three regimes. For example, in the “Inferior regime”, is the reason we need to retain the dataset because of a bad policy initialization, or because the pre-trained Q function is bad (and needed a parameter reset). For example, Fig 4 shows that BC-initialization is really bad unless the dataset is kept: this is probably due to the fact that there was no pre-trained Q function. This is the most unsatisfying part of the paper for me. The paper cites WSRL [1], which offers some discussion on how the quality of the pre-trained Q function impacts fine-tuning performance, and I would love to see more discussion of that in this paper. - There is a missed opportunity to discuss more on how the different pre-training methods influence fine-tuning results in the three different regimes. [1] Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data - In the superior regime, why does plasticity not matter? Shouldn’t plasticity matter in general for all three regimes? Fully human-written
The Three Regimes of Offline-to-Online Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces the stability-plasticity principle to explain the inconsistent empirical behavior observed in offline-to-online Reinforcement Learning. The authors propose a stability-plasticity principle for offline-to-online RL and a taxonomy of three regimes: Superior, Comparable, and Inferior. For each regime, the paper conducted an extensive empirical study and prescribed different fine-tuning tactics, either prioritizing stability around the offline pretrained policy or stability around the offline dataset, or a mixture of both. A large-scale empirical study is conducted to validate the framework, finding that the results align closely with the predicted regime-specific prescriptions - The paper is well written and the stability–plasticity principle provides a unified explanation for previously conflicting results in offline-to-online RL. - It clearly categorizes practical algorithmic choices (warm-up, replay, regularization, reset, etc.) within the framework. - The empirical evaluation is extensive and supports the paper’s recommendations. - The paper provides an empirical analysis of best practices across offline-to-online regimes, but it lacks a unified algorithm that automatically infers the current regime and selects the appropriate design choices. I recommend adding a simple, practical regime-detection algorithm that practitioners can use. - It is unclear whether approximating the dataset knowledge $J(\pi_D)$ by the dataset’s average accumulated return is appropriate. Why not use the behavioral cloning performance as an alternative proxy for $J(\pi_D)$)? - The experiments are mostly carried out on D4RL state base tasks. It will be great to further test it out on pixel-based tasks to see if the same framework holds true on high-dimensional settings as well. - Since the paper notes that BC typically achieves performance comparable to the dataset itself, why are so many BC-pretrained settings classified as Inferior or Superior? - How does your framework relate to dataset coverage? Offline RL usually does well in high-coverage regimes and poorly in narrow-data regimes, does this map onto the stability-plasticity explanation? Fully human-written
The Three Regimes of Offline-to-Online Reinforcement Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper conducts an empirical study over 21 D4RL tasks to analyze how the performance of offline pre-trained policies influence the performances of fine-tuning algorithms. In particular, the authors classify the fine-tuning algorithms based on how much it relies on the pre-trained policy vs. offline dataset (i.e., $\pi_0$-centric vs. $\mathcal{D}$-centric) and uncover a trend where $\pi$-centric fine-tuning algorithms tend to work well when pre-trained policy has better performance compared to that of the behavior policy, and vice versa (i.e., $\mathcal{D}$-centric fine-tuning algorithms tend to work well when the pre-trained policy has poor performance compared to that of the behavior policy). The authors also define some notions of prior knowledge, stability and plasticity to explain the trend. In particular, the authors propose the stability-plasticity principle which suggests that (1) for tasks where the pre-trained policy has poor performance, the fine-tuning algorithms need to have sufficient plasticity, and (2) for tasks where the pre-trained policy has good performance, the fine-tuning algorithms need to have sufficient stability. $\mathcal{D}$-centric methods and $\pi_0$-centric methods prioritize plasticity and stability respectively, explaning the empirical synergies with tasks with different pre-training performance. - The empirical study is thorough and provides convincing evidence that the performance of the pre-training policy can be used as a relatively robust predictor on (43 of 63 cases on D4RL tasks) which class of algorithms ($\pi_0$-centric vs. $\mathcal{D}$-centric) would work the best for online fine-tuning. - The selected $\pi_0$-centric and $\mathcal{D}$-centric algorithms are strong algorithms in the literature, making the empirical study fair and reasonable, which is valuable to the community. My main concerns about the paper are two-fold. *(1) Regime classification for the tasks (Superior vs. Comparable vs. Inferior) is conceptually flawed and may be subtly incorrect in many scenarios.* The authors propose to classify the tasks based on the return of the pre-trained policy $J(\pi_0)$ and the return of the behavior policy $J(\pi_\mathcal{D})$ and then argue that poor performing policy needs data centric algorithms with high plasticity (and often trades off stability). While this may work well for many cases empirically, it is based on a key assumption that the performance of the policy is indicative of its usefulness in generating data for online fine-tuning. For example, this assumption can break down in many long-horizon, sparse reward tasks where it is difficult to pre-train a policy to achieve non-trivial success rate directly, but the policy can progress in the task far enough to provide good exploratory data. This is where stability should be preferred over plasticity, but can be overlooked if only the returns are being examined. While it is true that finding a very principled classification can be quite challenging, relying the classification entirely on the expected returns is a bit too simplistic/crude of an approximation that is expected to not be very future proof (e.g., especially as we begin to study more challenging and complex tasks). *(2) The authors attempt to formalize the principle of plasticity and stability, the main intuition and explanation behind the performance predictions of online fine-tuning algorithms, with mathematical definitions, but these quantities are not being evaluated in the empirical study.* The paper makes three formal definitions: (1) prior knowledge, (2) stability, and (3) plasticity: Prior knowledge is the larger of the pre-trained policy’s return and the behavior policy’s return; stability and plasticity measure the how much the performance of the policy fluctuate over the course of training. In particular, the stability metric takes a minimum over the returns of $N$ policies over the course of online fine-tuning (at an unknown interval). This metric conceptually can be very sensitive to the choice of $N$ because it can be seen as taking the minimum of $N$ random variables (same for plasticity), which can make it hard to quantify to what degree the online fine-tuning algorithms exhibit stability vs. plasticity. Furthermore, I could not find any empirical value of these metrics from the paper to justify these definitions. Overall, without a more rigorous discussion of these definitions or empirical evidence that they correspond to the intuitions, these definitions in their current form add little value to the paper. 1. What are the values of prior knowledge, stability and plasticity for each class of the algorithms (e.g., $\pi_0$-centric, data centric)? 2. “We fixed a limited set of representative variants and hyperparameters across all settings” – how are the hyperparameters chosen? For these analysis results, it is very important to perform proper tuning of all the algorithms in the study to mitigate hyperparameter biases. Otherwise, the results may be explainable by just an artifact of a biased hyperparameter selection. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next