|
Design Principles for TD-based Multi-Policy MORL in Infinite Horizons |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper studies multi-objective RL in the infinite-horizon regime and argues that TD-style, online updates can support learning and executing a set of Pareto policies when paired with specific design prescriptions. The method is trajectory-centric: it attaches color labels to transitions and uses a Policy-Transition Table to keep an agent on a chosen Pareto policy, handles both stationary and non-stationary behaviors, normalizes away length/frequency biases that cause spurious dominance, and detects/encapsulates cycles so undiscounted averages remain meaningful. Experiments are ablations on DeepSeaTreasure (MO-Gym) that isolate the effect of each ingredient rather than contrasting against external baselines.
1. The paper identifies the missing link between deep supervised MORL and TD-based online methods and proposes a structured framework to bridge them.
2. The trajectory-centric policy-following mechanism offers a concrete and interpretable way to maintain consistent Pareto policies during learning.
3. Integrating stationary and non-stationary policy handling allows comprehensive Pareto-front reconstruction under infinite-horizon settings.
4. The ablation analysis effectively isolates how each design choice (e.g., SSM, cycle detection) contributes to policy stability and learning reliability.
- The paper does not compare against established baselines such as PCN, MPQ-Learning, or Pareto-DQN, all of which include experiments on broader environments (e.g., Minecraft-based or continuous-control benchmarks).
- The evaluation is limited to the tabular DeepSeaTreasure environment, so scalability and generalization remain untested.
- The theoretical grounding of the proposed principles is mostly heuristic; there is no formal convergence or optimality analysis supporting the modifications.
- While the framework is conceptually interesting, it lacks both the theoretical justification and large-scale empirical evidence needed to confirm its practical advantage.
1. The proposed principles are largely intuitive and heuristic. Can you clarify why they work in practice? In particular, could you provide theoretical analysis or empirical evidence beyond ablations that explains their actual effectiveness?
2. How do the bias-correction terms (trajectory length, reward frequency) theoretically influence convergence or Pareto coverage?
3. Compared to existing MORL baselines (PCN, MPQ-Learning), how do you expect the proposed framework to scale to larger or continuous domains such as Minecraft or complex control environments? |
Fully AI-generated |
|
Design Principles for TD-based Multi-Policy MORL in Infinite Horizons |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
MORL addresses multiple conflicting objectives by approximating the Pareto Front (PF)—a set of non-dominated policies representing trade-offs. The authors critique existing methods: supervised approaches such as Pareto Conditioned Networks (PCNs) require costly retraining and curated data, limiting online adaptation, while TD-based methods are confined to tabular, episodic tasks. They propose a trajectory-centric framework using colour-labelled transitions, tables for policy tracking, and mechanisms to handle stationary/non-stationary policies, spurious dominations and cycles. Ablation studies on an adapted DeepSeaTreasure environment validate the principles, demonstrating improved policy diversity and reliability. The paper concludes by positioning these principles as a foundation for deep RL extensions. Appendices detail algorithms and complexity analysis.
1. The paper is structured as a series of "design principles," and the methodology attempts to solve each identified problem in sequence (e.g., policy tracking, spurious domination, cycle detection). This is a logical way to build a complex algorithm.
2. The ablation study (Section 4) shows contribution of each "design principle" to the final performance (e.g., AS1 shows the need for policy tracking, AS5 shows the need for cycle detection). This provides clear empirical justification for the framework's internal components.
1. Clarity and Presentation: The paper introduces a dense vocabulary of new terminology (e.g., "color-label," "swirl trajectory," "Policy-Transition Table (PTT)," "Stationary Segments Mapping (SSM)") without sufficient formal definition or high-level intuition. Section 3 is exceptionally difficult to follow, making the core methodology hard to assess, reproduce, or build upon.
2. Disconnect Between Motivation and Future Extension: The paper motivates itself as laying a "principled foundation for future deep-RL extensions." However, the entire methodology is tabular, and the experiments are confined to a single 2D gridworld. The paper makes no attempt to explain how complex, discrete structures like the PTT and SSM could ever be scaled to the high-dimensional, continuous state spaces required by deep RL. This makes the primary motivation feel unsupported.
3. Insufficient Comparison to Baselines: While the internal ablation study is useful, the paper fails to compare its final algorithm against MORL baselines. Even if SOTA deep methods are unsuitable, comparisons against adapted tabular methods (e.g., Pareto Q-Learning https://jmlr.org/papers/volume15/vanmoffaert14a/vanmoffaert14a.pdf) are necessary to benchmark the algorithm's actual performance, sample efficiency, and computational cost.
3. Missing Theoretical Foundations (Markov Chains): The paper's unconventional treatment of "cycles" and "swirls" to manage infinite-horizon policies lacks rigor. A better analysis of infinite-horizon problems necessitates a connection to established Markov Chain theory (e.g., ergodicity, aperiodicity, etc.). The paper avoids this, making it unclear if the proposed cycle-detection and policy-tracking mechanisms are robust.
1. About the "Color-Label" Mechanism, how is the "color-label" c formally defined, generated, and stored? Is c a discrete integer? Does the space of c grow unboundedly as new trajectories and segments are discovered?
2. The main premise is to build a foundation for deep RL. What is the explicit proposed path for scaling the PTT and SSM structures? Would this require approximating these discrete, graph-like tables with a GNN or a similar architecture? How would the "color-label" concept translate to a continuous state space?
3. The distinction between "cycles" (stationary) and "swirls" (non-stationary) is central. Is a "swirl" (Fig 2a) simply a non-stationary policy that revisits states? How does the "implicit" temporal encoding (Fig 2b) offer a concrete advantage over a standard formulation that includes a time-step t or history h as part of the state?
4. The paper states the environment (DeepSeaTreasure) was "adapted to an infinite-horizon setting (agent goes to the initial state after the end of an episode)." This sounds like an episodic task that is simply reset, not a true infinite-horizon, continuing task. Can the authors clarify this? If the task is truly episodic, it may undermine the paper's entire motivation about infinite-horizons.
5. The algorithmic complexity for stationary policies in Appendix A.3 is much higher than that of non-stationary policies, while the authors stated ““they may be faster to learn” in lines 163–164. This sounds contradictory. |
Lightly AI-edited |
|
Design Principles for TD-based Multi-Policy MORL in Infinite Horizons |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents a trajectory-centric framework for Temporal-Difference (TD)-based multi-policy MORL that aims to achieve stable and interpretable behavior in infinite-horizon settings.
Unlike prior deep multi-policy approaches such as Pareto Conditioned Networks (PCN), which depend on supervised retraining and curated data, the proposed framework incrementally learns multiple Pareto-optimal policies through TD updates.
The authors introduce a set of design principles—including trajectory-level policy tracking, unification of stationary and non-stationary policies, removal of spurious domination, and explicit cycle detection—to ensure reliable policy following and meaningful undiscounted returns.
Through eight ablation studies on the DeepSeaTreasure benchmark, each design element’s contribution to policy stability, interpretability, and diversity is analyzed.
The work positions itself as a conceptual and algorithmic foundation for future TD-based extensions to deep multi-objective reinforcement learning.
- The paper clearly articulates the gap between supervised deep MORL (e.g., PCN) and online TD-based methods and proposes a principled bridge through trajectory-centric design.
- The trajectory-level policy-following mechanism (color labeling and Policy-Transition Table) provides a novel and interpretable way to maintain consistency across multiple Pareto-optimal policies.
- The framework’s unified handling of stationary and non-stationary policies offers full Pareto-front recovery while maintaining predictable policy behavior.
- The ablation results convincingly demonstrate the functional contribution of key components such as spurious domination correction, SSM, and cycle detection to stability and interpretability.
- The paper does not compare against established baselines such as PCN, MPQ-Learning, or Pareto-DQN, all of which include experiments on broader environments (e.g., Minecraft-based or continuous-control benchmarks).
- The evaluation is limited to the tabular DeepSeaTreasure environment, so scalability and generalization remain untested.
- The theoretical grounding of the proposed principles is mostly heuristic; there is no formal convergence or optimality analysis supporting the modifications.
- Experimental justification is largely qualitative, relying on hypervolume and visual coverage metrics rather than statistical performance comparisons.
- Algorithmic structure is complex (color labeling, PTT, SSM, cycle detection), potentially limiting reproducibility and computational efficiency.
- As a result, while the framework is conceptually interesting, it lacks both the theoretical justification and large-scale empirical evidence needed to confirm its practical advantage.
The proposed framework introduces several heuristic yet intuitively reasonable design principles (e.g., trajectory coloring, stationary-segment mapping, and spurious domination correction).
While the motivation behind each component is clear, it remains uncertain why these heuristics consistently lead to better learning dynamics.
Could the authors provide theoretical justification or empirical evidence—beyond ablation comparisons—that explains why these mechanisms are effective or under what conditions they provably improve convergence or Pareto-front coverage? |
Fully AI-generated |
|
Design Principles for TD-based Multi-Policy MORL in Infinite Horizons |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a framework for temporal-difference (TD)-based multi-policy multi-objective reinforcement learning (MORL) in the infinite-horizon setting.
The authors propose a color-coded representation of policies and trajectories, introduce mechanisms for policy tracking and stationary segment mapping, and formalize several “design principles” (Sections 3.1–3.6) aimed at preventing issues such as spurious domination and non-stationary credit assignment.
The framework is implemented in a tabular DeepSea Treasure environment, and eight ablation studies (AS1–AS8) are reported to justify individual design components.
1. The notion of mapping stationary segments and detecting cycles to handle infinite horizons is conceptually novel and mathematically consistent.
2. Despite being primarily theoretical, the paper supports every design element with an ablation (AS1–AS8), providing empirical intuition for why each component matters.
1. The framework’s setting—colored trajectories, segments, and swirls—is unconventional and a bit under-explained.
A concise motivating example illustrating would make the paper much more approachable.
2. The framework remains fully tabular and is evaluated only on DeepSea Treasure.
Without evidence or discussion of scalability, it is difficult to assess real-world practicality.
3. The paper devotes substantial space to figures and tables, leaving limited room for interpretation.
For instance, Table 1 lists all ablations but uses many internal terms unfamiliar to newcomers; more textual reasoning or summary commentary would be preferable.
4. The Appendix follows immediately after References (page 11) without a clear break.
Moving large tables (e.g., Table 1) to the appendix and separating these sections with \section*{Appendix} would improve readability.
1. Could the authors provide a small running example early in Section 3 to clarify the role of colors and how they differ from conventional policy identifiers?
2. How might the proposed tabular mechanisms (e.g., PTT, SSM) extend to function approximation or actor-critic settings?
3. In the ablation results (Section 4), are the reported improvements statistically significant over multiple seeds? |
Moderately AI-edited |