ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (75%) 4.00 4.00 3416
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 6.00 4.00 928
Total 4 (100%) 4.50 4.00 2794
Title Ratings Review Text EditLens Prediction
PT$^2$-LLM: Post-Training Ternarization for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes PT2-LLM, a post training quantization framework for ternary quantization. Model weights follow iterative ternary filtering to minimize the layerwise weight-level L2 error. Following this, a single round of activation-aware grid alignment is run to minimize the layerwise activation L2 error. The paper also proposes column reordering to minimize the effects outliers for block level quantization. PT2-LLM improves over several methods in both perplexity and downstream evals. * The ternary quantized model gets a significant speedup over the 2-bit model. * Quantization runtime remains comparable to GPTQ. * The paper also evaluates frontier open source models like Qwen3. The paper does not compare with vector quantization methods like QuIP# or AQLM that allows for ultra-low precision. * When applied without ITF, does AGA still overfit with multiple steps? * Has blockwise thresholding been tried for ITF? Fully human-written
PT$^2$-LLM: Post-Training Ternarization for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles the challenge of performing ternary (3-value) quantization on large language models in a post-training (PTQ) setting, where weight distributions and outliers make naive ternarization highly error-prone. The authors propose PT2‑LLM, combining an Asymmetric Ternary Quantizer (ATQ) (with row-wise offset and scale) and a Structural Similarity-based Reordering (SSR) step to rearrange columns before quantization to reduce variance. Experiments on multiple LLMs (e.g. LLaMA variants) and benchmark tasks show that PT2‑LLM outperforms existing quantization baselines, achieving lower perplexity and higher accuracy under the same compression constraints. The results suggest that careful adaptation of ternary quantization (non-symmetric, order-aware) can make extreme low-bit quantization feasible for LLMs in PTQ settings. - The method explicitly models asymmetry in weight distributions and block-level structure, making the ternarization process more compatible with real LLM architectures. - The combination of the Asymmetric Ternary Quantizer (ATQ) with a structural similarity-based reordering (SSR) step is clever and helps reduce quantization error. - The authors evaluate on multiple large language models and a range of NLP tasks, showing consistent gains over state-of-the-art quantization baselines under comparable compression constraints. - In addition to accuracy metrics, they analyze memory, latency, and inference speed, demonstrating PT2‑LLM’s utility in real-world deployment scenarios. - Although the paper achieves strong final results, the three claimed innovations lack originality and are quite similar to prior methods such as AWQ, RPTQ, and ARB-LLM, appearing more as a combination of existing core techniques rather than truly novel contributions. - SSR and the calibration steps (ITF / AGA) add additional computation and implementation complexity in the PTQ pipeline, which could be nontrivial for very large models or constrained environments. - The quantization accuracy tends to degrade more significantly on newer models (e.g., LLaMA3-8B and Qwen3-14B-Base), suggesting that the method may not scale robustly to newer models. - The paper could provide more comprehensive evaluation on challenging benchmarks such as MMLU, GPQA, and GSM8K to better demonstrate its effectiveness across diverse reasoning tasks. [1] Lin, Ji, et al. "Awq: Activation-aware weight quantization for on-device llm compression and acceleration." Proceedings of machine learning and systems 6 (2024): 87-100. [2] Yuan, Zhihang, et al. "Rptq: Reorder-based post-training quantization for large language models." arXiv preprint arXiv:2304.01089 (2023). [3] Li, Zhiteng, et al. "Arb-llm: Alternating refined binarizations for large language models." arXiv preprint arXiv:2410.03129 (2024). Please refer to the weaknesses above. Fully AI-generated
PT$^2$-LLM: Post-Training Ternarization for Large Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a novel post-training ternarization (PTQ) framework tailored for Large Language Models (LLMs), aiming for efficient compression without retraining. It reduces the average bitwidth to 1.58, and also achieve better performance against state-of-the-art (SOTA) 2-bit PTQ methods. 1. Focus on Extreme Low-Bit PTQ: The paper explores the challenging but important domain of sub-2-bit post-training quantization, an area with significant potential for deploying LLMs on resource-constrained devices. 2. Systematic Optimization Pipeline: The proposed two-stage optimization process (ITF for weights, AGA for outputs) is a structured approach to a complex problem, attempting to balance local reconstruction error with global functional preservation. 3. Favorable Reported Results on Standard Benchmarks: The reported results in Table 1 indicate that the proposed 1.58-bit method achieves lower perplexity and higher average accuracy on several standard zero-shot QA tasks compared to the selected 2-bit PTQ baselines across various model scales. 1. Weak Theoretical Grounding and Unclear Practicality of the SSR Strategy. The proposed Structural Similarity-based Reordering (SSR) strategy lacks a strong theoretical connection to the primary objective of minimizing quantization error. Unlike Hessian-based methods... its objective (maximizing cosine similarity) is decoupled from the actual error minimization process. This weak theoretical grounding makes it difficult to understand why it works or to guarantee its effectiveness. This ambiguity is compounded by a lack of implementation clarity and cost analysis. The paper claims SSR is a "lightweight strategy" but fails to provide a detailed algorithm or an analysis of its computational overhead, making it difficult to assess its true practicality and reproduce the results. 2. Unsubstantiated Claim of SOTA due to Lack of Direct Comparison: The paper's central claim of state-of-the-art performance at 1.58 bits is not adequately supported. The evaluation primarily relies on comparisons against higher-bitwidth (2-bit) methods. While outperforming them is a positive signal, a direct, apple-to-apples comparison against another 1.58-bit PTQ baseline is necessary to validate the true efficacy of the proposed techniques over other potential ternarization strategies. The absence of such a comparison represents a significant gap in the evaluation, leaving the paper's primary claim unsubstantiated. 3. Insufficient Validation on Recent and Diverse Architectures: The paper's claims of general applicability are undermined by an incomplete evaluation on recent model families and a complete lack of testing on Mixture-of-Experts (MoE) architectures. While thorough on older LLaMA/LLaMA-2 models, the validation on the more recent LLaMA-3 and Qwen3 families is limited to a single small-scale model. This raises questions about the method's robustness on larger variants like LLaMA-3-70B. Furthermore, the absence of any experiments on MoE models (e.g., Qwen3-MoE) is a significant omission, as these architectures present unique quantization challenges that are not addressed. 4. Mismatch Between Model Capabilities and Evaluation Benchmarks: The choice of evaluation benchmarks fails to adequately probe the impact of extreme quantization on the core competencies of the tested models. Specifically, while the paper includes modern, reasoning-capable models like Qwen3, it restricts evaluation to older, simpler zero-shot QA benchmarks (e.g., ARC, BoolQ). These benchmarks do not effectively measure the multi-step, complex reasoning abilities that are a hallmark of such models. Consequently, the evaluation may be masking significant performance degradation on these crucial capabilities. A comprehensive assessment requires reporting performance on advanced reasoning benchmarks (e.g., MATH, AIME). 1. Can the authors provide a more formal justification for the SSR strategy, explaining its connection to quantization error minimization beyond the intuition of variance reduction? Furthermore, please clarify the algorithmic details of its integration and provide an analysis of its computational overhead. 2. Could the authors provide a direct comparison against another 1.58-bit PTQ method to substantiate their claim of state-of-the-art performance? If no such public method exists, could they construct a strong baseline (e.g., by adapting GPTQ or other frameworks to a ternary setting) for a more rigorous comparison? 3. Why were the evaluations on the LLaMA-3 and Qwen3 families limited to a single model each? To support the claim of generalizability, please provide results on larger models like LLaMA-3-70B and on MoE variants like Qwen3-MoE. 4. Given that Qwen3 is known for its reasoning abilities, can the authors provide evaluation results on more challenging reasoning benchmarks like MATH or AIME to give a more transparent assessment of how 1.58-bit quantization affects these critical capabilities? Fully AI-generated
PT$^2$-LLM: Post-Training Ternarization for Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces PT2-LLM, a novel post-training ternarization framework for large language models (LLMs) that compresses weights to ternary values $\{-1, 0, +1\}$ without retraining. The authors address two key challenges in post-training quantization (PTQ): training-free parameter optimization and handling outliers/dispersed weights. Their solution comprises an Asymmetric Ternary Quantizer (ATQ) with a two-stage pipeline—Iterative Ternary Fitting (ITF) for grid construction and rounding, and Activation-aware Grid Alignment (AGA) for output calibration—alongside a Structural Similarity-based Reordering (SSR) strategy. The paper presents a technically innovative approach to an underexplored problem—ternarization in PTQ settings for LLMs. The ATQ framework with its ITF-AGA pipeline is well-motivated, addressing asymmetric weight distributions through mathematically grounded grid optimization (Equations 9 and 13). The SSR strategy offers a clever solution to outlier handling by leveraging structural correlations, demonstrating significant error reduction in Figure 3. Empirical evaluation is comprehensive, spanning multiple model architectures (7B–65B parameters) and tasks (perplexity, zero-shot QA), with rigorous ablation studies validating component contributions (Table 2). The 7.17× compression ratio and end-to-end speedup highlight practical value, while the commitment to code/model release enhances reproducibility. Block size sensitivity is inadequately explored. All experiments use fixed block size 128 (Section 4.1), but no analysis justifies this choice or examines its impact on quantization error. Model diversity is restricted to decoder-only transformers (LLaMA/Qwen3), excluding encoder-decoder or hybrid architectures. Scalability analysis is absent: no computational complexity bounds are provided for ATQ/SSR. Key baselines are missing. The paper compares to 2-bit PTQ methods (GPTQ, AWQ) but omits other ternarization approaches in PTQ settings (e.g., TernaryLLM [Chen et al., 2024] without training). Parameter sensitivity is superficial: only calibration set size is ablated (Table 2c), while critical hyperparameters (e.g., threshold $\Delta \approx 0.75/m \sum |\tilde{W}_{:,j}|$ in Equation 3) lack sensitivity analysis. See the weakness above. An interesting job! And I would be willing to raise my ratings if the above concerns are well solved. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next