|
PRPO: Collaborative Online Policy Learning in Personalized RLHF |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes PRPO, a federated approach to online RL where agents aggregate reference policies instead of model weights. The idea is interesting—replace PPO's reference policy with a peer-averaged one—but the execution has serious problems. The theory doesn't match what's actually implemented, a critical related work (FedRLHF) is completely missing, and the "personalization" claims aren't backed up by the experiments. While I appreciate the scope of the experimental work, the foundational issues need substantial fixing before this can be published.
- The problem is important and timely. Privacy-preserving personalized LLM training is something people care about.
- You've put together a substantial experimental evaluation spanning multiple domains. The classical RL experiments (even if I think they don't fit the narrative) and the RLHF experiments represent real work.
- The implementation seems solid. Detailed hyperparameter tables with multiple random seeds. This is good practice.
- The framework generalizes across TRPO/PPO/GRPO, which is nice.
- I appreciate the different communicator designs you explore (uniform, reward-based, similarity-based).
### Major Issues
**1. Theory-experiment mismatch (this is the biggest problem)**
Your problem formulation (Eq. 2) optimizes a single policy $\pi$ on different initial state distributions: $f_i(\pi) = \mathbb{E}[f(s_0, \pi)]$ where $s_0 \sim \zeta_i$. But your algorithm (Eqs. 4, 7, 9a) optimizes different policies $\{\pi_{\theta_i}\}$ for each agent.
These seem to be contradictory. If the true goal is a single shared policy, why not just use standard federated averaging? If the goal is different personalized policies, why aggregate them at all? Aggregating personalized policies seems counterproductive to personalization. You need to pick one formulation and stick with it, and provide proper justification.
**2. Missing FedRLHF baseline**
FedRLHF (Fan et al., AAMAS 2025) addresses the identical problem:
- Federated, privacy-preserving
- Personalized RLHF
- With convergence guarantees
- Sample complexity showing linear speedup
The paper appeared in December 2024, well before the ICLR submission deadline. Not citing or comparing to this work is a critical oversight that immediately calls into question your novelty claims.
**3. Observation 1 is impractical**:
Equation (12) says $\varepsilon \le (\delta - \gamma\lVert \pi^{(k-1)} - \pi^{(k)} \rVert_\infty)/2$. But to compute this bound at step $k$, you need to know $\lVert \pi^{(k-1)} - \pi^{(k)} \rVert_\infty$, which is the quantity you're trying to control! Looking at your experiments (Table 2), you just use fixed "communication penalty coeff" values from a grid search. So Observation 1 doesn't actually inform your implementation at all. Why include it?
**4. Theorem C.1 doesn't support your claims**
The theorem assumes:
- Convexity (Assumption 1)
- Uniform averaging $C = (1/n)$
- Bounded gradients
Your experiments use:
- Neural networks (non-convex)
- Self-preferred averaging with $p \in \{0.2, 0.6, 1.0\}$
- No gradient bounds
More importantly, the convergence rate you prove matches standard single-agent mirror descent. You're not showing that collaboration provides any benefit—only that it doesn't break things. That's a much weaker claim than what the introduction suggests.
**5. "Personalization" is claimed but not demonstrated**
All your RLHF agents use the same reward model trained on tldr-preference. The pTLDR "heterogeneous" setup only varies the data (different subreddits)—not the preferences or objectives.
Data heterogeneity ≠ preference heterogeneity.
For real personalization, you'd need different reward models. Agent 1 might prefer concise summaries, Agent 2 might prefer detailed ones, Agent 3 might want humor, etc. You never test this scenario.
The improvements in Table 1 could easily be explained by:
- Regularization (averaging reduces overfitting)
- Variance reduction
- Better hyperparameters for PR-PPO vs baselines
- Implicit ensembling effects
Without measuring whether individual agents' policies remain distinct or serve their specific preferences, you can't claim you've demonstrated personalization.
**6. Policy aggregation vs weight aggregation is unjustified**
Your core novelty is aggregating policies ($\sum_j c_{ij}\,\pi_j(\cdot\mid s)$) instead of weights ($\sum_j c_{ij}\,\theta_j$). But you provide zero theoretical analysis of when/why this is better.
For linear policies $\pi_\theta(a\mid s) = \mathrm{softmax}(\theta^\top \phi(s,a))$, aggregating distributions is approximately equivalent to aggregating parameters. For neural networks, the relationship is complex and you don't explore it.
The experimental comparison to FedAvg isn't obviously fair either. Table 2 shows different hyperparameter grids for different methods. Did you match the total search budget? How do we know FedAvg isn't just under-tuned?
### Moderate Issues
**7. Literature review gaps**
Beyond FedRLHF, you cited various MARLHF works but don't properly discuss them.
Your related work claims "none apply RLHF techniques" and "no work addresses online RL in federated setting." This is inaccurate given the works above.
**8. Classical RL experiments don't fit the narrative**
In MiniGrid and Atari, all agents solve identical tasks (same FourRooms environment, same BeamRider game). How does this relate to personalization?
If agents have the same objective, why would policy aggregation help? Maybe variance reduction? Exploration diversity? You don't analyze or explain this. The experiments are fine, but they don't support your main story about personalization.
**9. Evaluation methodology concerns**
- LLM-as-judge has known biases (length, verbosity, style)
- No human evaluation baseline
- No statistical significance testing
- No inter-rater reliability metrics
- EVP aggregates over hyperparameter searches with different search spaces (Table 2), making comparison potentially unfair
**10. The reward model defeats privacy claims**
You train a single reward model on centralized tldr-preference data. If you have privacy constraints preventing you from centralizing training data, wouldn't you also have constraints on centralizing preference data for the reward model?
True federated RLHF should involve learning personalized reward models locally (See FedRLHF, Fan et al. AAMAS 2025). That's the real challenge, and you don't address it.
1. **Problem formulation**: Equation (2) optimizes single policy $\pi$, but equations (4)/(7)/(9a) optimize different policies $\{\pi_i\}$. Which is correct? How do you reconcile this?
2. **FedRLHF**: Why isn't FedRLHF (arXiv:2412.15538) cited or compared? How does your approach differ?
3. **Observation 1 implementation**: How do you implement equation (12) given its dependency on $\lVert \pi^{(k-1)} - \pi^{(k)} \rVert_\infty$? Why do experiments use fixed penalty coefficients?
4. **Personalization**: Can you provide experiments where agents have genuinely different objectives (different reward models)? How do you measure whether policies remain personalized?
5. **Theorem C.1 applicability**: Your theorem assumes convexity and uniform averaging, but experiments use neural networks and self-preferred averaging. How does the theorem inform your work? |
Fully AI-generated |
|
PRPO: Collaborative Online Policy Learning in Personalized RLHF |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Peer-Referenced Policy Optimization (PRPO), which turns the KL regularizer in proximal policy methods into a communication channel, where each client updates its policy against a composite reference policy formed by weighted averaging peers’ action distributions rather than the old policy of the client used by standard proximal policy algorithms. The authors applies peer-referenced KL on PPO and GRPO, and explores several non-expansive communication operator, including uniform average, self-preferred, policy similarity, and reward-weighted averaging. Empirical results show the effectiveness of PRPO over isolated training and FedAvg on both classical RL and RLHF tasks. The paper provides theoretical intuition through mirror-descent style analysis, showing that if the communication operator is non-expansive, PRPO preserves the trust-region guarantees of TRPO and thus admits convergence under restricted conditions.
1. The idea of recasting the KL penalty term as a coordination mechanism is simple yet effective, and can serve as a plug-and-play generalization of PPO-type algorithms to collaborative settings.
2. The authors provide detailed experimental settings and a reproducible codebase.
1. The convergence proof in Appendix C assumes convex, L‑smooth objectives with bounded gradients, which generally does not hold for RLHF with neural policies. Moreover, the theorem is proved only for uniform policy averaging the communicator $C=(c_{ij}) =\frac{1}{n}$ but neglecting the other practical adaptive communicators (e.g., similarity-based or reward-weighted) used in the experiments. It is unclear whether the guarantees extend to those practical communicators.
2. LoRA adapters and distribution‑level policy sharing are used for privacy preserving. However, it's questionable that whether privacy can indeed by guaranteed, as the paper provides no formal privacy analysis, especially considering prior works of inversion attacks [1,2].
3. Most plots report average performance across agents, making it unclear whether every user benefits or whether some suffer negative transfer under a weighted reference policy.
4. The similarity-based communicator and reward‑based communicator are used in classical RL experiments, but not evaluated in the RLHF experiments, where only the self‑preferred mean aggregation is used.
5. The RLHF evaluation is limited to TL;DR summarization with a single base model (i.e., `Mistral-7B`). Robustness would be more convincing with additional tasks (e.g., dialogue such as UltraFeedback) and alternative/larger base models.
6. Some important related works that handle preference heterogeneity at the reward‑model level are missing [3,4].
[1] Carlini, Nicholas, et al. "Extracting training data from large language models." 30th USENIX security symposium (USENIX Security 21). 2021.
[2] Petrov, Ivo, et al. "Dager: Exact gradient inversion for large language models." Advances in Neural Information Processing Systems 37 (2024): 87801-87830.
[3] Park, Chanwoo, et al. "Rlhf from heterogeneous feedback via personalization and preference aggregation." arXiv preprint arXiv:2405.00254 (2024).
[4] Liu, Renpu, et al. "A Shared Low-Rank Adaptation Approach to Personalized RLHF." arXiv preprint arXiv:2503.19201 (2025).
Please see the weakness section. In addition, I have the following questions:
1. Given multiple plausible communicators, how should practitioners select or adapt the communicator for a given task? Do you have criteria or an automated procedure for this choice?
2. How would PRPO behave when agents have different or even conflicting reward functions (e.g., divergent user preferences over the same prompt for RLHF). As the current communicators perform linear averaging, could cross‑agent influence harm some agents’ local policies? If so, can PRPO detect or mitigate this? |
Fully human-written |
|
PRPO: Collaborative Online Policy Learning in Personalized RLHF |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies personalization for online RLHF with heterogeneous agents. It proposes Peer-Referenced Policy Optimization (PRPO), which allows each client to perform on-policy RL updates while being regularized toward a peer-aggregated reference policy constructed via policy distribution communication, instead of relying only on its own past policy. The method is evaluated in both standard RL control tasks and multi-agent RLHF summarization.
The main advantages are: (1) the problem formulation — decentralized, privacy-constrained online RLHF with heterogeneous clients — is largely unexplored compared to prior work that treats federated alignment as offline supervised fine-tuning or simple weight sharing; and (2) the proposed peer-referenced policy optimization algorithm is itself new, in that it uses a communicated, aggregated reference policy at the distribution level to shape each client’s on-policy updatThe main advantages are: (1) the problem formulation — decentralized, privacy-constrained online RLHF with heterogeneous clients — is largely unexplored compared to prior work that treats federated alignment as offline supervised fine-tuning or simple weight sharing; and (2) the proposed peer-referenced policy optimization algorithm is itself new, in that it uses a communicated, aggregated reference policy at the distribution level to shape each client’s on-policy update.e.
1. The problem formulation in Section 3.1 does not seem fully aligned with the stated motivation. The motivation is to train multiple users to personalize LLMs using online reinforcement learning under private communication constraints, but the formulation assumes one fixed reward for all users, whereas in practice different users would likely have different reward models.
2. The related work section lists connections to federated LLM fine-tuning, federated RLHF, and multi-agent RL, but it does not clearly explain why these directions are directly relevant to the proposed setting before emphasizing differences.
3. The classical RL experiments use three agents but do not clearly establish true heterogeneity between them, so it is unclear whether this setup actually reflects the “heterogeneous clients” described in the motivation.
1. Can the authors clarify how the single shared reward in Section 3.1 matches the goal of per-user personalized alignment? Is this addressed by the RLHF experiment design?
2. If multi-agent RL is considered closely related, why are the experimental comparisons limited to federated RL baselines and not MARL baselines? If communication is the main link, does the proposed communication operator directly draw from any existing MARL techniques?
3. In the GRPO-based methods, many recent GRPO-style approaches appear to downweight or remove the KL-divergence penalty in practice. If the KL term is not essential, could the authors report RLHF results for a GRPO variant without the KL penalty (e.g., “Dr. GRPO”) to isolate the effect of peer-referenced KL regularization?
4. For the “isolated” PPO / GRPO baselines, are these models trained separately per agent on that agent’s own local data, or is there a single global model trained on all data pooled together? |
Lightly AI-edited |
|
PRPO: Collaborative Online Policy Learning in Personalized RLHF |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Peer-Referenced Policy Optimization (PRPO), a distributed reinforcement learning (RL) framework applied to both classical RL tasks and large language model (LLM) personalization via reinforcement learning from human feedback (RLHF). The authors introduce policy-based communication protocols, namely PR-PPO, PR-GRPO, and PR-TRPO, which enable multiple agents to update their local policies using shared peer information without directly sharing private data. The main claims are that PRPO enables effective collaboration while preserving data privacy and that the resulting policies outperform standard baselines such as PPO, even in federated RLHF settings. Experiments cover classic continuous-control benchmarks and a summarization task using LLMs, with performance gains reported under decentralized communication.
The policy-based communication scheme appears to be an original and promising alternative to weight-based federated learning, especially for RL settings where local policy distributions can capture useful structural information. The experimental results suggest performance improvements in both control and RLHF tasks, and the availability of open-source code enhances the reproducibility of the findings. The theoretical framework is grounded in existing policy optimization methods and offers generalization across TRPO, PPO, and GRPO variants.
Several conceptual and applied weaknesses limit the impact of this work. The privacy assumptions and models are not well-defined: while communicating policies avoids sharing raw data, it is unclear whether this is preferable to sharing low-rank weight updates, such as LoRA adapters. The experimental evaluation lacks analysis of communication overhead, scalability, and stability in larger multi-agent systems. The claims about LLM personalization are underdeveloped: the paper does not address how user-specific preference data is incorporated or whether federated RLHF aligns with personalization goals. There are multiple notational inconsistencies and typos that detract from clarity, and a lack of clear discussions in the main text (e.g., convergence) makes the work feel incomplete.
What is the precise privacy threat model being addressed in PRPO? How does sharing policies differ from sharing model weights in regard to data leakage or reconstruction risks, especially in the LLM case?
How is it possible for PR-PPO or PR-GRPO to outperform centralized PPO in RLHF tasks? Were the total number of training rounds or samples equivalent? Additional details would clarify whether the comparison is fair.
Can the authors provide more detail on the communication cost and memory footprint of PRPO in multi-agent LLM scenarios, particularly under scaling to dozens or hundreds of agents?
Since personalization is a key application claim, how does the method incorporate user-specific reward structures or feedback? Would aggregating in the policy space dilute personalization, especially in diverse preference settings?
Why do the RLHF experiments only use a single communication protocol? Would the other peer-referenced protocols satisfy the right-stochastic condition and be viable in the same scenario? |
Fully AI-generated |