ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 2.00 4.00 9534
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 4.67 3.67 2201
Total 4 (100%) 4.00 3.75 4034
Title Ratings Review Text EditLens Prediction
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Dynamic Experts Search (DES), a test-time scaling strategy tailored for Mixture-of-Experts (MoE) LLMs. The key idea is to treat the number of activated experts k at inference as a controllable search dimension, rather than a fixed architectural constant. DES has two components: (1) Dynamic MoE, which exposes k as a knob during generation to induce structural diversity in reasoning trajectories; and (2) Expert Configuration Inheritance, which keeps k consistent along a single trajectory but varies it across trajectories so the search can both maintain stability within a path and explore diverse configurations across paths. The empirical picture is fairly comprehensive: DES is evaluated across several MoE models (Qwen3-30B-A3B, Ling-lite-1.5, OLMoE-1B-7B-Instruct, DeepSeek-V2-Lite-Chat), two verifiers (Qwen2.5-Math-PRM-7B and Llama3.1-8B-PRM-Deepseek-Data), and multiple benchmarks in math (MATH500, AIME24/25, GSM8K, SVAMP), code (HumanEval, LiveCodeBench), and knowledge (LiveBench reasoning). Against Best-of-N, Beam Search, and DVTS, DES consistently improves accuracy and precision at similar reported cost. Ablations suggest that (i) exploring multiple k values increases the chance of hitting a configuration that yields a correct solution, and (ii) inheriting k along a trajectory filters out unpromising configurations as the verifier prunes candidates, increasing pass@N and final accuracy. The paper also argues that DES does not simply “use more experts,” showing the average activated experts during search does not exceed the default. Overall, the paper pushes a simple but compelling idea: leverage the latent structural flexibility of MoE at test time to unlock complementary solution sets that temperature-based sampling alone does not reach. - The central idea is fresh and timely: moving beyond output-level diversity (temperature/top-p) to architecture-aware diversity by controlling MoE’s activated expert count. It’s a natural fit for the growing prevalence of MoE models. - Methodologically simple and easy to reason about: treating k as a test-time knob plus a straightforward inheritance rule that gives implicit configuration-level credit assignment along the search. - Empirical breadth and consistency: many models, two verifiers, and diverse benchmarks with sensible, consistent improvements over multiple baselines. The gains are not tied to one model family or evaluation domain. - Thoughtful ablations: the paper separately probes the contributions of exploring different k and inheriting k; the violin plots and step-wise views help argue that improvements are not a byproduct of trivially activating more experts. - Practical upside: if exposing k can be standardized in inference stacks (e.g., vLLM), DES could be a “drop-in” test-time enhancement that avoids retraining and scales with budget. - Cost accounting is under-specified for MoE: the paper mainly uses generated tokens as a proxy for computation, but per-token cost scales with k in MoE. Even if average k doesn’t exceed the default, small shifts matter. A fairer comparison would report FLOPs or at least “token-count × average-k” (and preferably align average k across methods or normalize results by FLOPs). - Reproducibility and implementation detail gaps: changing top-k routing at inference isn’t universally exposed; capacity factors, load-balancing, and token dropping can complicate things. The paper should document how k is controlled in practice (per-layer or global, capacity handling, any router noise) and quantify the throughput/latency/memory impact of varying k. - Evidence on “complementary solution sets” could be tightened: the Jaccard analysis is persuasive, but I would like to see stronger statistics across seeds/difficulty strata and a low-temperature or fixed-seed setting isolating structural (k) diversity from sampling diversity. - Baselines could be stronger on diversity: comparisons to more diversity-focused search variants (temperature schedules, top-p sweeps per branch, stochastic beam, nucleus-beam hybrids, or recent foresight/token-level diversity methods) would better establish DES’s advantage. - Decision-rule inconsistency: the text alternates between PRM-best and majority vote for final selection. The main results should lock in one rule (and report the other as supplemental), along with an analysis of where the two diverge. - Sensitivity analysis: results depend on the initial k set, T, M, and temperature. A more systematic study with CIs would help practitioners know how to set these knobs and how robust the gains are to reasonable changes. - PRM reliance and potential leakage: the paper should explicitly discuss possible overlaps between PRM training data and the eval sets, and include a sanity check with an alternate verifier or a pure answer-checking setup to show robustness. - Minor editorial issues: a duplicated paragraph in the appendix on math answer extraction; one appendix table labels a verifier as a “policy model”; and some notation (e.g., Top_M) could be made more precise. These are easy fixes but worth cleaning up. - The idea is interesting and promising, but the implementation and cost accounting need to be clearer. For reproducibility and fair comparison, I’m especially looking for the following. 1) k implementation and control - How exactly do you expose and tune the MoE router’s top‑k in your inference stack (vLLM/Transformers)? Is k global across all MoE layers or configurable per layer? When k changes, do you adjust capacity factor, token dropping, or router noise to keep routing stable? - Does changing k require edits to model weights or router kernels? Please provide the minimal diffs/flags for reproducibility. 2) Compute cost accounting - Since per‑token FLOPs scale with k in MoE, token count alone isn’t sufficient. Please report FLOPs or a reliable surrogate (e.g., tokens × average‑k × number of MoE layers), and plot accuracy vs. FLOPs under cost normalization. - Provide throughput/latency/memory as a function of k (tokens/s, per‑sample latency, peak VRAM), and show how DES affects system efficiency across context lengths. 3) Capacity and routing behavior - As k increases, do you see more capacity overflow or token dropping? What capacity factor and load‑balancing settings do you use? Please report drop rates vs. k and any mitigation. - Do you add router noise at inference? If so, how does it interact with k and affect reproducibility? 4) Decision rule consistency - The paper alternates between majority voting and PRM‑best. Which rule is primary for the main results? Please add an agreement analysis when they diverge, with an error taxonomy and sensitivity to budget N. 5) Structural vs. sampling diversity - The Jaccard analysis is compelling. Can you isolate structural (k) diversity from sampling diversity by fixing seeds or using very low temperatures while varying only k? Please report statistics across multiple seeds and difficulty strata, with significance tests. - Include per‑problem case studies where changing k flips failure to success, and describe the associated routing/load changes. 6) Diversity‑oriented baselines - To strengthen claims over diversity‑aware methods, please add or clarify: stochastic beam variants, temperature/top‑p schedules per subtree, nucleus‑beam hybrids, token‑level diversity (e.g., Phi‑decoding), and stronger DVTS configurations. Match budgets under a common cost normalization. 7) Hyperparameter sensitivity and robustness - Run systematic sweeps (with CIs) over the initial k set (range and granularity), T, M, and temperature. How sensitive are gains to these knobs? What happens as n (number of initial k values) is reduced? - Report step‑wise average k and its variance under DES, and verify the claim that “average k does not exceed default” across datasets and budgets. 8) PRM dependence and potential leakage - Audit for overlap between PRM training data and evaluation sets (MATH/AIME, GSM8K, LiveBench, etc.). If overlap exists, how do results change after filtering? - Add robustness tests with an alternative verifier and a pure answer‑checker (no PRM) to assess DES’s reliance on PRM scoring quality. 9) When and why DES helps - Can you correlate problem characteristics with preferred k (e.g., routing entropy, gate‑margin distributions, reasoning length/depth)? Any evidence that certain layers benefit more from larger/smaller k? - Does per‑layer heterogeneous k (or a schedule that adapts k across steps) further improve performance, and is it stable? 10) Fairness of “thinking mode” comparisons - Are comparisons equalized by FLOPs or only by tokens? Please report accuracy vs. FLOPs and pass@N, and time‑to‑first‑correct where applicable. Clarify stopping criteria and any length penalties. 11) Reproducibility details - Will you release code/configs to toggle k at inference? Please provide seeds, prompts, router settings, vLLM flags, and any CUDA/kernel constraints. Also fix minor editorial issues (duplicated math answer‑extraction paragraph, verifier mislabeled as “policy model,” precise definition of Top_M in Eq. (4)). 12) Beyond MoE - Do you see analogous “structural knobs” for dense models (e.g., dynamic depth/width, selective channels in grouped‑FFN) where the DES paradigm transfers? Any preliminary evidence? - Overall, I like the motivation and direction. If you firm up the implementation and cost details above, the conclusions will be stronger and the work easier to reproduce and extend. Fully AI-generated
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work proposed an inference-time scaling method that integrates two key components: 1) Dynamic MoE: allows direct adjustment of the number of active experts during inference, and 2) Expert Configuration Inheritance: maintains consistent expert usage within a reasoning trajectory while permitting variation across different runs. The paper systematically studied how the number of activated experts at test time influences the model’s reasoning behavior. They found varying the number of activated experts produces complementary solution sets, offering a new source of diversity in addition to output-sampling. - The paper studies a new perspective of test-time search specifically for MoE models, which would have a large audience given people's interests in MoE LLMs. - The paper is well written. - As shown in Table 1, the proposed method doesn’t show significant improvement compared to best-of-N strategy. - The work explores different expert counts at inference time. However, the number of activated experts is often fixed during training. So this would introduce parameter distribution shift since it doesn't align with the training behavior. The paper doesn’t show the significance of the proposed method by qualitative case studies or quantitative experiments or theoretical proof. - This inference-time scaling method is not applicable to closed-source models. The paper mentions "DES enhances accuracy and stability without additional cost." How do we prove its stability compared to other baselines? Fully human-written
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a new Test-Time Scaling (TTS) strategy that leverages the structural flexibility of Mixture-of-Experts (MoE) models to improve reasoning performance. The proposed method, Dynamic Experts Search (DES), dynamically adjusts the number of activated experts during inference and maintains consistent configurations along reasoning paths through an inheritance mechanism, thus balancing stability and diversity in exploration. Experiments across multiple MoE-based LLMs and reasoning benchmarks show that DES achieves higher accuracy and precision than prior TTS baselines without increasing too much computational cost, offering an architecture-aware perspective on improving LLM reasoning. 1. Paper is well written and easy to follow. 2. Substantial experimental results are presented to show the effectiveness and efficiency of the proposed method. 3. The proposed method is simple and novel. 1. Some technical details seem not fully logical. Please see my questions below. 2. The proposed method should be used with a trained reward model, which is not always available in all reasoning domains. This is an obvious limitation. 1. Why do you think using different numbers of experts at each step of generation would waste compute? It is true that if we keep k intact for a complete trajectory generation, different k induces different results. However, it does not mean that in each trajectory, employing different k at each step would lead to inferior result compared to an optimal fixed k. I feel that the motivation of expert configuration inheritance should be fixed. 2. What if now we are using DES to reason over some tasks without existing reward models? Fully human-written
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work introduced a test-time scaling method for reasoning LLMs with the Mixture-of-expert architecture. Different from traditional strategies that focuses on generating different reasoning traces, the authors proposed to unlock more flexibilities of the reasoning model by considering different configurations of the number of experts activated in the reasoning process. This idea is well motivated in Figure 1 - the correlation plots showed that activating different experts can significantly impact the type of questions the model can correctly answer. In experiments, the authors explored a simplified strategy to prove the effectiveness of the method. Under the given computation budget, the model start from uniformly distributed number of experts. For each task, the policy model helps search for the optimal number of expert number and the number. Each reasoning trace has a fixed number of expert. Experiments show that the proposed method has pros and cons. with a low computation budget, the method performs worse than standard beam search. with higher computation budget, the proposed method achieves better performance than baselines because each num-of-expert setup can be better explored. # Very Interesting Topic - The idea of exploring different expert setups is very interesting and also well motivated by experiment results. - The experiment setup is well defined and comparisons are pretty fair. - Experiments show that under certain conditions the proposed method can outperform standard beam search and selective baselines - The ablation studies provides helpful insights about the proposed method # Claims / clarifications - Section 4.2.1 claims that the proposed method "constantly outperform baselines on both accuracy and precision", but actually this claim is only true when the model is provided with a massive computation budget. - When compared to thinking mode, it's unclear which sampling method is used. and there is no experiment results on DES + thinking # Effectiveness - I think Figure 5 is the core observation of this work - when there is enough computation budget to explore different setup of expert number, the model will achieve better performance. As a result, I actually think the proposed method highly rely on bigger budgets and won't be effective in any application that require efficient inference. # Utilization / scaling [not weakness, just my opinion] - I feel when the model size is increased, the observation about differnt expert setup will strongly differentiate the answerable questions will become weaker. As a result, the proposed method might not scale to bigger models and in my opinion, the authors should highlight that this method only works for smaller models, and might not contribute to the R&D of larger models in the limitation - I'm not asking the authors to experiment with frontier-level open-source models # Depth of the research - I'm very excited to read the motivation and the problem statement, and I believe there are very interesting problems in this space. Fixing the number of experts across layers & tokens seems over-simplified compared to the motivation of this paper. - I understand the search space for expert number shouldn't be num_layer * num_token * num_expert, but the authors could explain or cite related studies to give readers better intuition about how to narrow down the search space. see weakness Fully human-written
PreviousPage 1 of 1 (4 total rows)Next