|
MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
- The paper explores the problem of scaling vocabulary size in language models, though the focus is specifically on input-side vocabulary scaling for encoder-only architectures.
- The core idea is to use a mixture-of-experts (MoE) architecture, where each expert is assigned to a group of tokens that were originally more fragmented under a smaller vocabulary and are now merged into less-fragmented units in the newly expanded vocabulary.
- Empirically, MoVE demonstrates strong performance across the MTEB benchmark, showing that models trained with vocabularies scaled up to 500K tokens consistently outperform naive vocabulary scaling methods, achieving better representation quality and efficiency without additional inference cost.
1. The motivation and objectives of the paper are clearly articulated, with a well-balanced discussion of the trade-offs between smaller and larger vocabularies, as well as the training data and computational overhead that come with scaling vocabulary size.
2. Adapting Mixture-of-Experts (MoE) for Vocabulary Expansion:
The paper well adapted the MoE framework to the problem of vocabulary scaling, where experts are specialized for subsets of tokens, enabling efficient representation learning under large-vocabulary settings.
3. The paper conducts extensive experiments and ablation studies to rigorously examine the effectiveness and efficiency of the proposed approach.
1. While the paper's idea of grouping over-fragmented tokens from a smaller vocabulary into less-fragmented units under a larger vocabulary is technically sound, this strategy is not novel. It has been widely explored in prior vocabulary adaptation and expansion research. However, the paper does not adequately acknowledge or situate itself within this broader line of work.
Missing references:
- WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models (Minixhofer et al., 2022)
- FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models (Dobler et al., 2023)
- Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? (Han et al., 2025)
Also, the related work section is relatively narrow, focusing mainly on encoder-only models (yes, this paper focuses on it but the main idea on the vocabulary scaling and adaptation extends beyond the architecture) and a few recent studies, while omitting earlier and conceptually related approaches to vocabulary transformation and transfer. It should have been discussed for a more comprehensive contextualization of MoVE’s contribution.
2. The reported effectiveness of the proposed method seems to be largely driven by the Mixture-of-Experts (MoE) mechanism itself than by the vocabulary expansion strategy, as shown in Table 3 with the same expert dimension.
1. What is the exact expert dimension (d) used for the "corresponding" configurations in Figure 3 and Table 1?
2. In Table 1, the naive baseline consistently outperforms all other methods on the Clustering datasets. Could the authors elaborate on the possible reasons for this behavior |
Heavily AI-edited |
|
MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper investigates the problem of scaling input vocabularies for encoder-only text embedding models. To address this, the paper proposes Mixture-of-Vocabulary-Experts, which generates embeddings for a very large "routing vocabulary" ($V_l$) by compositionally combining tokens from a smaller, well-trained "base vocabulary" ($V_s$). The paper further conducted corresponding experimental analyses to prove the effectiveness of the method.
S1: The paper focuses on the vocabulary scaling methods of language models and transforms the n-gram mechanism used in decoder-only models into the routing-based aggregation mechanism in encoder-only models. This transformation enables the generation of hierarchical vocabulary embeddings, allowing the encoder-only model to more effectively capture and encode the overall content information.
S2: The paper conducts extensive experimental analyses to validate the effectiveness of the MoVE method and performs ablation studies to elucidate the rationale and design choices behind MoVE components.
W1: The authors have repeatedly mentioned that the long tail of rare tokens often exhibits the phenomenon of under-training, which stems from the difference in the amount of gradient updates. Therefore, the author should thoroughly analyze the training dynamics of different token embeddings, for example, the differences in the update gradients or embedding norms between hot tokens and cold tokens, to support the claim. Figure 3(a) merely proves that the effect is better, but it cannot provide an intuitive perception of the degree of improvement for this type of problem.
W2: The paper would benefit from broader experimental configurations and more in‑depth analyses. In particular, reporting a more precise scaling curve (with finer-grained points and controlled settings) would produce stronger, more actionable insights and better motivate follow-up research.
W3: Figure 3 suggests that vocabulary size and number of experts are two critical hyperparameters. The authors should focus on these and provide principled guidance for selecting them (e.g., trade-offs, heuristics, or rules of thumb). Such analysis would substantially increase the practical value of the proposed design.
W4: The authors performed ablation experiments on the effects of different tokenization algorithms. However, in Figure 4, the original mapping function (TokenMonster) and the BPE algorithm should be plotted together in a single chart; this would allow a more effective comparison of their performance differences and of the trends in vocabulary scaling. In addition, the authors should briefly summarize the differences between TokenMonster and conventional BPE tokenization to provide further insight.
W5: Some conclusions appear premature. For example, in Table 5 the introduction of MoE on top of MoVE leads to degraded performance—this is plausibly attributable to the increased parameter count without a corresponding increase in training budget, rather than an inherent incompatibility. A more careful analysis (controlling for parameter count and training resources) is needed before drawing firm conclusions.
W6: The routing function likely has a substantial influence on the embedding quality discussed in the paper. The authors should conduct additional analyses and experiments focused on routing (e.g., different routing strategies, routing sparsity, and their interaction with embeddings). Such work could yield important additional insights and strengthen the paper’s contributions.
Q1: Since the expert component includes two matrix parameters and also involves dense matrix operations, it should be taken into account. Are these parameters included in Table 2? |
Lightly AI-edited |
|
MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper first empirically investigates the effect of vocabulary sizes on the massive text embedding benchmark (MTEB) using a DistilBERT-like architecture, showing that (i) optimal vocabulary size depends on the size of training data, and (ii) larger vocabulary sizes tend to yield better performance when sufficient training data is available. Building on these observations, this paper explores a method for using large vocabulary sizes without requiring large training data. To this end, the paper proposes MoE-based vocabulary handling, which uses hash-based routing and aggregates base-vocabulary token representations into a larger source-vocabulary representation using mean pooling. The results on MTEB show that vocabulary scaling with MoE is promising; it outperforms several state-of-the-art approaches and provides lower inference latency than the vanilla approach using the base vocabulary.
1. This paper is well-written and easy to read. It is supported by a well-motivated background in model scaling, vocabulary scaling, and data constraints.
2. The analysis is quite extensive, covering many aspects of the MoE design.
3. The proposed method is novel in that it incorporates the MoE design into the vocabulary (embedding) of an encode-based model for the first time.
4. The proposed method exhibits consistent performance gains over strong baselines, along with its superior inference latency over the vanilla approach.
1. Some notations in Section 3.1 make it hard to follow. I’d suggest using different notations for tokens in $V_s$ and those in $V_l$. Alternatively, the paper could use Figure 2(a) more effectively to specify which part each sentence refers to.
2. Almost all experiments only consider BERT-4L. Only Section 4.1 is accompanied by a larger BERT-12L result. Afterwards, BERT-4L is exclusively used without justification. In particular, as the comparison against contemporary baselines in Table 1 is the core comparison in this paper, it should also be conducted with BERT-12L to demonstrate that the method's effectiveness is consistent across different model scales. Furthermore, Figure 3a lacks BERT-12L results with varying sizes of vocabulary. This prevents us from examining the effect of the model scale.
3. Differences between SCONE and the proposed method in Table 1 seem quite marginal, with only a 0.35 and 0.27 difference on average for the 200k and 500k vocab settings, respectively. Are these differences statistically significant? Moreover, while the paper states that the proposed method “consistently archives the best results” (L311), SCONE often achieves the best results in the 500k-vocab setup. This also warrants clarification.
1. Related to Weakness 3, is there any advantage of using the proposed method against SCONE, except for downstream performance gains? For instance, does it achieve better inference latency over SCONE? |
Fully human-written |
|
MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes MoVE (Mixture-of-Vocabulary-Experts), a framework that extends subword vocabularies using a small base vocabulary and expert-based decomposition. Each large-vocabulary token is represented as a composition of base tokens, assigned to a routing expert for embedding refinement. The approach aims to combine the efficiency of small vocabularies with the representational richness of large ones.
1. The study addresses a real efficiency–representation trade-off in encoder-based models: small vocabularies yield longer sequences, large ones suffer from data sparsity.
2. The idea of decomposing rare tokens via base subwords and routing them to balanced experts is elegant and practical.
3. The paper systematically explores vocabulary scaling, routing strategies, and tokenizer choices, presenting broad results on MTEB and BEIR benchmarks.
4. The argument that larger vocabularies can shorten input sequences and thus reduce inference latency is intriguing and relevant for retrieval deployment.
1. The paper focuses only on BERT-style encoders. Since the vocabulary–expert idea is general, it would be important to verify whether similar benefits hold for decoder or seq2seq architectures (e.g., GPT, T5, LLaMA).
2. While the method is empirically validated, the paper provides little theoretical discussion on why vocabulary experts improve representation quality, e.g., how the decomposition affects embedding space geometry, token frequency bias, or gradient propagation.
1. In Figure 3a, you claim that a 4-layer BERT with MoVE (500k vocab) matches a 12-layer BERT. Can you show exact metrics under identical data, training schedule, and seeds?
2. For the reported >35% inference speedup, what are the actual average sequence lengths (in tokens) for each vocabulary size, and were offline-cached embeddings allowed for baselines?
3. Figure 6 compares routing strategies (hash vs. learned vs. cluster). Can you provide standard deviations or multiple-seed averages for these bars? |
Fully AI-generated |