|
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an online unsupervised approach for action quantization. This approach has the potential to reduce the complexity of the action space by learning a relatively small but useful action set so that downstream tasks could use this more compact action set to learn more efficiently. In the first step, the agent maximizes the entropy of a mixture policy, while minimizing the entropy of individual components of the mixture, which generates distinct but more focused components. In the second step, a discrete-action algorithm treats the learned and fixed components or their mode as actions and maximizes rewards. In theory, the paper shows that 1) the learned action modes of a mixture policy with enough capacity is lossless (covers the whole action space) in the discrete setting, and 2) the gradient estimators for the unsupervised learning of mixture policies. Empirically, the paper illustrates the impact of the strength entropy regularization and capacity of the mixture, and investigates the performance of this approach in continuous-control tasks.
This paper has the following strengths:
1. Learning useful sub-policies in reinforcement learning (RL) is an important problem and has the potential of scaling up RL to solve more complex problems.
2. The proposed unsupervised learning objective for learning diverse mixture components is novel and demonstrated to be effective.
3. It provides useful theoretical characterizations of the proposed method. While the theorems do not appear to be difficult to prove technically, they are useful to describe some of the fundamental properties of learning the mixture policy with the unsupervised learning objective.
4. The paper is well-written. It’s easy to follow and to find specific details in the appendix.
Despite the strengths, the paper has some weaknesses to be addressed:
1. The benefit of the proposed approach is not well demonstrated. Since pretraining is performed to extract behavioral prior, the paper does not demonstrate such an approach improves efficiency compared to methods that train from scratch (SAC / PPO).
2. The current empirical evaluation is quite limited. 1) Experiments are only performed in a subset of MuJoCo environments and a few other stand alone tasks. It’d be beneficial to increase the coverage of the tested tasks. For example, more MuJoCo locomotion and navigation environments. 2) The lack of comparison to the natural, trivial uniform quantization.
3. Some limitations of the proposed approach are not discussed: 1) One of the limitations of the supervised learning approach is that it requires a state-dependent action space to encourage meaningful learned components. If such “available action sets” are not available, the learned components will likely just randomly scatter across the action space. 2) Another limitation is that the paper does not investigate how the components could be fine-tuned in down-stream tasks when pretrained components are not optimal and even be limited.
Here are questions that might impact the rating:
1. The paper mentions the learned components generalize across environment layouts (Line 96). Could the authors clarify if the mentioned result is in Table 1? Further, could the authors clarify the observation spaces in pretraining and downstream training?
2. Could the authors provide learning curves for the experiment results in Section 4?
3. Could the authors provide comparisons to the trivial uniform quantization?
4. Whether DQN or PPO is used for each experiment? It’s unclear from the discussion in Lines 408-409.
5. As mentioned in the appendix, the proof of Theorem 4 appears to be similar to that of a theorem in He et al. (2025). In addition, I also found some parallels between Section 3 and Section 4 of He et al.. Could the authors clarify the difference between the two and highlight the contributions in this paper?
Other minor suggestions:
1. Line 362: It’s confusing to see the acronym SD. Similarly, it might be better to spell out USD in the caption of Table 2.
2. Standard errors are proper confidence intervals. It’s difficult to tell the statistical significance of the reported results in the tables. It’d be better to use proper confidence intervals. |
Fully human-written |
|
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors present an unsupervised method based on online RL to learn efficient tokenisation of actions in a continuous control problem, by learning a mixture policy that maximises the entropy of the actions while the sub-policies in the mixture minimise the entropy. The authors then use their method to derive quantised actions that can be then optimised through an actor-critic algorithm to solve control problems more efficiently, and they show how the same action quantisation generalises to different tasks. They provide empirical results showing how the method is comparable or improves over existing work.
- The problem of how to compress complex RL problems into interpretable, efficient sub policies or skills is very relevant.
- The solution proposed by the authors (the maxi-mix-mini-com objective) is interesting and I think unsupervised methods are a reasonable approach to solve this problem.
- The empirical evaluation and the baselines compared are sufficient to demonstrate the efficacy of the method.
- I am not completely convinced by the method as is, see the box below for questions.
- The method seems to rely on quite a few heuristics, which are not always clearly explained. For example, the authors state in the limitations that they need to downscale the variance of the learned component policies a posteriori. It is not very clear how much engineering the method needs to work at scale.
I write the questions in order of appearance in the text (not in order of relevance).
1. What does long-lived or short-lived mean, intuitively and formally?
2. Would the objective in (2) be maximised by a uniform mixture of deterministic policies?
3. If this is the case (and please correct me if i'm wrong), I’m not sure why this requires learning. Couldn't you construct these policies online with zero learning cost? At each step, you construct the sub-policies by as deterministic (or very low variance) actions, sequentially such that each new policy is sufficiently different from the already generated ones, and then you randomise with equal probability for the mixture.
4. Continuing on this line, what is preventing from converging to trivial (useless) policies? How can you assure that the sub-policies learned are of any interest? A trivial solution to the objective (if I understand correctly), with e.g. 2 policies, would be each policy picks one action deterministically, and both policies are mixed with equal weights, but this is not necessarily interesting in many cases.
5. Shouldn't the unsupervised ‘tokenisation’ be somehow linked to other policy metrics like obtained rewards? In line 193 you mention that the method leads to a compression of the actions, but can this lead to an ‘arbitrary’ compression that does not take into account rewards (and in fact it will not take them into account). So is it possible that the quantisation learned completely stops the agent from being able to get high rewards? I can think of toy examples where this would happen and i’m not convinced this does not happen in general. If this is correct, then the fact that the quantised policies work in the examples you tested is because the reward function is expressive enough, perhaps. Please correct me if I misunderstood some part.
6. What is the intuition for the method doing much better than PPO? I would expect an unconstrained algorithm (in terms of the sub-policies) to still do better, but at a higher complexity cost. |
Fully human-written |
|
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes an unsupervised method to quantize the action space to learn ‘tokens’, specialized one-step policies, which are used by downstream policies to tackle a wide range of tasks. The key idea behind the proposal is the maxi–mix–mini–com entropy objective: maximize the entropy of the mixture policy to ensure coverage, while minimizing the entropy of each component for specialization. A tractable iterative algorithm is proposed and theoretically shown to converge to a unique solution in the tabular setting. Empirical results indicate that the learned tokens are task-agnostic and can be used to achieve performance comparable to methods explicitly designed for single-task optimization.
The paper is well-motivated and clearly structured for the most part. The experiments in the tabular domain effectively illustrate the core idea, while those in continuous control environments demonstrate the scalability of the proposed approach. Furthermore, the supplementary videos help convey the qualitative behavior of the learned representations. The paper presents four theorems that substantiate its main claims and experimental findings. All four proofs appear to be sound. I was able to follow and did not identify any errors in the proofs of Theorems 1 and 2. The proofs of Theorems 3 and 4 are comparatively straightforward, as they build directly upon the results established in [1] and [2].
However, the paper has three issues:
An important advantage of unsupervised behavioral tokenization is its potential to improve sample efficiency in downstream tasks. As stated in Lines 42–43, “By focusing on core representative tokens, behavioral tokenization can improve sample efficiency, accelerate convergence, and avoid wasteful exploration of irrelevant continuous actions”. However, the paper reports only the final downstream performance after 3 million training steps, without intermediate evaluations. Presenting learning curves or periodic evaluations for the proposed method and the baselines would provide clearer empirical support for this claim.
The paper does not specify how the hyperparameters were chosen for both the proposed method and the baselines. Were defaults used? If any hyperparameters were tuned, did each method get a fair opportunity to tune the same number of hyperparameters? Were hyperparameters tuned across environments?
The reported results are based on only five random seeds. Strong claims cannot be made based on such a small number. You could either increase the number of seeds, and then justify why it is sufficient, or you could aggregate across environments and only make claims at the aggregate level. You could then just show individual runs per environment, to qualitatively show behavior.
Finally, though not strictly a weakness, it was unclear why the modes of the component Gaussian policies were used as discrete actions. To quote:
“Action space is quantized using the modes of the learned component Gaussians, a method that provides the first unsupervised online AQ algorithm. Next, state-dependent quantized actions are used as discrete actions in a DQN (Mnih et al., 2013) or discrete PPO (Schulman et al., 2017). The learned components and quantization are not allowed to change during the optimization of cumulative reward using the discrete controller”
Couldn’t the component policies themselves be considered the tokens, and DQN learn to take the action of sampling from one of the component policies? This is not obviously better, but it would be useful to discuss this choice more.
(Putting Minor Points here, as there is no separate box. These are not major issues)
1. It was not immediately clear what the state- or action-centric perspective referred to. The concept only became clear after reading Section A.1 (Lines 799–809), which appears too late in the paper. I would suggest moving parts of this explanation to the introduction for better clarity and accessibility.
2. The term d_{\pi_{m,\theta}} is introduced in Theorem 3 but defined only later in Theorem 4.
3. Line 329: citation formatting should be corrected to “from Jang et al., 2016” (without brackets).
4. The references section requires proper formatting.
5. Line 1309 – It appears that M in \mathbb{R}^M has not been defined earlier. Based on the context, it seems to correspond to | \mathcal{A} |.
6. In Section A.5, the paragraph on “Downstream Tasks” is missing the word “Table” when referring to the results.
References:
[1] Sutton, Richard S., and Andrew G. Barto. "Reinforcement learning: an introduction, 2nd edn. Adaptive computation and machine learning." (2018)
[2] He, Jiamin, et al. "Investigating Mixture Policies in Entropy-Regularized Actor-Critic."
Questions, summarized from the above weaknesses.
1. Could you provide results on learning efficiency?
2. Could you explain how hyperparameters were chosen?
3. Could you justify claims based on the current number of seeds, or provide updated results to support claims?
4. Can you explain the choice of using modes of the Gaussian component policies?
I have currently put scores based on uncertainty on these points. For example, I cannot assess soundness without understanding how hyperparameters were chosen, so even though the theory is sound, I had to put a lower rating there. I will adjust this based on responses to questions. |
Fully human-written |
|
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes an algorithm called maxi-mix-mini-com that can learn transferable behaviour tokens in an unsupervised manner. Maxi-mix-mini-com aims to maximise the discounted sum of future entropies while minimising the entropy of each behaviour token. By cleverly introducing a new optimisation variable $r$, the authors presented a provably convergent algorithm that can learn diverse behaviours. The authors also empirically demonstrates that by learning a high-level policy based on the learned behaviour tokens, they can easily obtain a high-performing policy, manifesting the reusability and diversity of the learned behaviour tokens.
The paper is well-written and easy to understand. In particular, the introduction section provides a great overview of existing research and effectively articulates the significance of their work, making it one of the best introductions I've recently read. Also, the proposed algorithm is backed by rigorous mathematical theorems and extensive empirical analysis on various benchmarks.
Although the paper is, in general, easy to follow, there are some parts that need further clarifications.
1. Denoting the right-hand side of (4) by Q is a bit misleading, because technically speaking, it is not a Q function.
2. Please add a cross-reference of Figure 4 to Section 2.1.
3. Lines 315-316 are difficult to understand. How is $\pi_{k,\theta}(a\mid s)$ defined?
4. Figures 3(c) and 3(d) needs more explanation. The action distribution would be different for each state.
I suppose training upon the learned behaviour tokens would drastically speed up the high-level training process. Could you provide learning curves for the **Transfer of learned behavioral tokens across tasks** experiments? |
Fully human-written |