ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 2081
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (67%) 4.00 3.50 2220
Total 3 (100%) 4.67 3.33 2174
Title Ratings Review Text EditLens Prediction
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a method for style alignment in the offline RL setting using implicit Q learning and advantage weighted regression. Styles are defined using hard coded functions which is then used as a reward to learn a style value function. This value function is combined with a task value function (independent of style) to train a style aligned policy. Experiments are conducted on the circle and halfcheetah environments, showing significant performance advantage over baselines such as SORL and SCBC. Ablation experiments demonstrate how different temperature parameters prioritize task performance and style alignment. * The proposed solution is quite simple and sound. * The effectiveness of the proposed method is clear. * I think the presentation can be improved if the authors moved some of the plots in the appendix to the main paper. * Some details in the method can be better explained. * I find the need to tune the temperature parameter and its sensitivity a downside of the proposed method. * In (12), can you add a text explanation of the equation? Is the gating saying that if the style advantage is high enough such that the sigmoid output is 1 then you can incorporate task advantage? In theory the advantage function at optimality is zero $\max_{a}Q(s, a) = V(s)$, the sigmoid output is 0.5, so you are still using a small weight on the task reward advantage. * In the results, you did not include an in-depth explanation of the different datasets. Can you explain how you expect the method to behave differently for different datasets? From Table 1, it looks like halfcheetah-vary performs worse on the baseline methods than the other halfcheetah datasets. Why? * Can you comment on the sensitivity of the temperature parameters? * I would suggest moving some of the plots in the appendix to the main paper so that people understand what style means. * (Minor) there are a lot of typos in the paper. Please fix. Fully human-written
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes SCIQL, an offline reinforcement learning algorithm designed to learn policies that optimize task reward while exhibit specific behavioral styles. Building upon IQL, SCIQL extends it to the style conditioned setting and introduces GAWR mechanism to balance the two advantage terms. The proposed GAWR mechanism and sub-trajectory labeling provide a simple yet effective way to integrate style supervision into offline RL. Empirical results on Circle2D and HalfCheetah environments show that SCIQL consistently achieves higher style alignment scores compared to the baselines. 1. The problem formulation is conceptually unclear. If style alignment and task reward are inherently conflicting, the object should be to balance the trade-off between the two. However, the current formulation seems to sacrifice task reward to increase style conformity, which raises the question of whether this trad-off is explicitly modeled. 2. Given that style alignment and task reward clearly conflict as shown in Section 5.3, the evaluation might be better framed in a Pareto optimality context rather than using single averaged metrics. Without such discussion, it is difficult to interpret whether improving style alignment at the cost of lowered reward constitute genuine progress. 3. The paper defines style labels as discrete categories obtained via predefined labeling functions. Could the authors clarify why a discrete formulation was chosen over a continuous style ? Using continuous representations might allow smoother interpolation between styles and potentially improve generalization to unseen or mixed style combinations. 4. The evaluation is restricted to toy circle 2d and halfcheetah environments., which are relatively simple and low-dimensional. It would strengthen the work to include results on more diverse environments, such as other MuJoCo or Atari tasks or humanoids-tyle control demands where stylistic variations are more naturally expressed. 5. It would be valuable to assess whether the proposed method can extrapolate (or interpolate) to unseen style labels or novel combinations of style labels that were not encountered during training. 1. Is z a multi-dimensional vector aggregating multiple criterion-specific labels, or a single discrete label ? If it is the former, the description around lines 180-190 should be revised to clarify how multiple criterion labels are annotated and used in z. 2. Minor typos. Line 453, Twhile -> while. Line 169, " is reversed. Fully human-written
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper propose a new view of the stylized policy learning problem as a generalization of the goalconditioned RL and introduce SCIQL algorithm which uses hindsight relabeling and Gated Advantage Weighted Regression mechanism to optimize task performance. This paper provides a unified formulation of behavioral style learning via programmatic sub-trajectory labeling, and introduces the SCIQL+GAWR framework that effectively balances style alignment and task performance in the offline RL setting. The reliance on hand-crafted style labeling functions constrains scalability to more abstract or subtle styles, and may require domain expertise when applied to complex environments. The algorithmic pipeline is relatively intricate, increasing implementation burden, and evidence on large-scale real-world or high-dimensional robotic systems remains limited The proposed approach relies on hand-crafted sub-trajectory labeling functions; how scalable and generalizable is this design to tasks where styles are abstract, high-level, or difficult to encode programmatically? While the method demonstrates strong performance in simulated benchmarks, there is no evaluation on real-world systems or higher-dimensional robot control tasks. Can the authors comment on the expected practicality and robustness of SCIQL in real settings? The overall pipeline introduces multiple components and optimization stages; how sensitive is the method to hyperparameters, and can the authors provide an ablation isolating the contributions of each module to ensure that improvements are not due to increased model complexity? The approach assumes accurate style labels from the labeling functions. How does performance degrade under noisy or imperfect style annotations, and can the method handle ambiguous or overlapping style categories? The paper positions programmatic style labeling as scalable, but could the authors discuss potential avenues for extending the framework to automatically learn style representations, or integrate human feedback when labeling heuristics are insufficient? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next