ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 4.00 3.00 2814
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 3.00 1668
Fully human-written 1 (25%) 2.00 4.00 3300
Total 4 (100%) 4.00 3.25 2649
Title Ratings Review Text EditLens Prediction
How Long Do Model Patches Last? A Temporal Perspective on PortLLM Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates the longevity and stability of model patches — small, targeted modifications applied to large language models (LLMs) to correct factual errors, adjust alignment, or steer preferences without full retraining. The authors propose an experimental framework for evaluating patch persistence under different settings: Patch Types, Tasks, and Metrics. The study highlights a crucial but underexplored aspect of LLM maintenance: how long model edits or behavioral fixes truly last. - The paper addresses an important yet under-discussed topic in LLM reliability — the temporal persistence of behavioral interventions, which is highly relevant to real-world LLM maintenance, safety, and continual learning research and very novel. - The authors examine multiple patch types and task categories systematically. - The observed rapid decay and patch interference patterns are convincing and empirically grounded. Figures (especially Fig. 5–7) effectively demonstrate non-linear patch degradation trends. - The chosen tasks (mainly factual correction and alignment) are useful but narrow. No reasoning, multi-hop, or tool-use scenarios are explored, which might reveal more complex patch dynamics. - Limited comparison to recent model editing robustness papers such as MEND[1], MEMIT[2], and ROME[3]. [1] Fast Model Editing at Scale [2] Mass-Editing Memory in a Transformer. [3] Locating and Editing Factual Associations in GPT. - How do you define “decay” quantitatively? Is it based on performance drop relative to baseline, or absolute change in logits / accuracy? - Have you tested patch persistence across different model scales (e.g., 7B → 13B → 70B)? - Did you observe any cases where later fine-tuning strengthened rather than degraded patch effects? Fully AI-generated
How Long Do Model Patches Last? A Temporal Perspective on PortLLM Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies the temporal stability of PortLLM-style “training-free” patches: a patch trained once at ($t_0$) is applied to successive, continuously-pretrained base model checkpoints (${\theta_t}$). It (i) builds a longitudinal evaluation pipeline to measure performance drift, (ii) shows that linear/exponential trends fit the drift better than a no-trend baseline, and (iii) proposes two lightweight, data-free decision tools—failure-time estimation and a target-date hypothesis test—to decide when re-training is needed. 1. Clear problem framing with strong practical motivation (well-scoped RQs). The paper pinpoints a real deployment frication: upstream checkpoints arrive frequently; downstream teams want to reuse a one-time patch without constantly re-tuning. The two research questions (“when will performance fall below tolerance?” and “will it still meet tolerance on a business-critical date?”) are decision-oriented, tie directly to capacity planning, and are answered with tools that require only historical eval metrics (no access to upstream data or re-training). This tight alignment between phenomenon, RQs, and actionable outputs is a genuine plus for applied research. 2. A reusable longitudinal evaluation pipeline. The setup—fixed patch from (t_0), successive base checkpoints under continual pretraining, uniform eval protocol, repeated runs—creates a controlled environment to isolate “patch–base misalignment over time.” The pipeline surfaces trend shape (linear vs. log-linear), supports repeated-measure statistics, and includes sensible ablations (segment granularity, token density, multiple samples per time step). This is a useful template others can adopt to stress-test patch transfer under temporal drift. 1. External validity: single model lineage. The main longitudinal results focus on one family (e.g., 7B-scale within a single architecture). Without cross-architecture and cross-scale validation (e.g., Llama/Gemma/MoE; 7B→13B→70B), it remains unclear whether the observed drift rates and the proposed decision tools generalize beyond this lineage. 2. Upstream evolution is approximated with LoRA rather than full-parameter or heterogeneous vendor updates. Modeling base-model evolution as LoRA updates on attentions/FFN risks baking in “low-rank additivity” assumptions that may not hold when vendors do full-parameter continual pretraining, change training recipes, or alter architectures. This gap limits confidence that the measured drift and the tool calibration transfer to real release cycles. 3. Temporal axis is short and partially synthetic. The five-month, evenly segmented timeline (two-week, equal-token slices) is convenient but not representative of actual release cadence (e.g., quarterly “big” releases plus intermittent patches). The auxiliary “composed” time series that stitches disparate corpora further departs from realistic evolution. Conclusions about trend linearity and predictability could change under longer, lumpier, or recipe-shifting timelines. 1.Task coverage and metrics. Can you expand to production-relevant workloads (code, tool-use, long-context, safety), and replace/augment BLEU for math with verifiable-correctness metrics (programmatic verifiers, unit tests, GSM-style exact correctness)? This would test whether drift rates and the decision tools remain calibrated on harder-to-game metrics. 2.Mechanism: quantify “patch–base misalignment.” Beyond the qualitative explanation, can you measure representational/optimization drift—e.g., CKA similarity across layers, low-rank subspace angles for patched vs. re-tuned adapters, curvature/Fisher changes, or gradient alignment—to link measurable misalignment to observed performance decay? This would strengthen causal plausibility and might reveal layers/subspaces where patches are most robust. Fully AI-generated
How Long Do Model Patches Last? A Temporal Perspective on PortLLM Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. PortLLM is introduced as a training-free patching mechanism that enables patch reuse across consecutive LLM releases. This paper conducts large-scale experiments showing that PortLLM patches experience performance decline over time. The results demonstrate that performance degradation is a general and measurable risk when PortLLM is used over extended periods. To address this, the authors propose forecasting algorithms that estimate patch failure dates and test hypotheses about performance at future target points. Their framework allows downstream developers to anticipate degradation and make informed decisions about when retraining or re-patching becomes necessary. - Important topic of studying when patching fails - Clean design that uses historical evals, with precision/AUC analyses and concrete usage guidance - Multi-checkpoint experiments consistently show measurable decay, motivating the need for forecasting rather than blind refresh schedules - The whole analysis relies on the PortLLM paradigm. It is unclear how broadly the results transfer to other patching approaches. - The entire evolution study uses UpVoteWeb slices, and Appendix G argues that alternative slices don’t change the trend, but the single dataset study on UpVoteWeb may still not reflect vendor training distributions, evaluation policies, or update magnitudes. - Linear model assumption fits PortLLM in this setup, but does it generalize beyond the linear setting? - Forecasts are derived from a small number of checkpoints, which constrains the ability to model longer-term dynamics. It is unclear how well the estimators predict beyond the observed window. see Weaknesses Lightly AI-edited
How Long Do Model Patches Last? A Temporal Perspective on PortLLM Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Frequent updates to base LLMs lead to significant retraining costs for developers who adapt these LLMs to downstream tasks. To address this issue, PortLLM (Khan et al., 2025) proposed a data- and training-free patching method for portability of patches across temporally evolved LLMs. This paper focuses on this particular approach and studies its long-term effectiveness. In particular, the main goal here is to be able to answer this question: How long can PortLLM patching remain effective as the base model evolves? To answer this, authors perform statistical analysis of patching performance trends across model updates, and develop a time series modeling framework that characterizes patching performance as a structured temporal process. Based on this modeling framework, authors provide lightweight algorithms to determine when or whether to retrain, without requiring retraining at every base model release. Experiments were conducted using Mistral-7B model, UpVoteWeb as the continual pretraining corpus and four downstream evaluation datasets. The paper focuses on a practical problem. The main claims of the paper and the experimental analysis are well presented. **Unrealistic experimental setup** * This work focuses mainly on base LLMs that are evolved via continual pretraining. However, in practice LLM training involves multiple stage/phases (pre/mid/post, SFT/Preference alignment/RL) and practitioners often use post-trained models in real world applications due to their superior instruction following capabilities and human aligned behaviors. Also, for most of the models out there today, the difference between various model versions is rarely as simple as continual pretraining. * In real world settings, model developers work to make sure that the future model releases are better than earlier model releases. Indeed, in practice, most models get better (on a wide variety of tasks) with successive model releases over time. The continual pretraining setup used this paper (continually pretraining on a small Redditt corpus) is unrealistic as it is not reflecting the realistic setup of base models that improve with time. If we look at Fig. 2 and Fig 27, the base models are getting significantly worse on BoolQ, WinoGrande, ARC-Easy with time and barely improves on MathGenie. Model developers would rarely release such continually degrading models to downstream developers. **Conclusions that do not extend beyond a particular pretraining corpus** All the behaviors observed in this paper are strongly tied to the small pretraining corpus used. Continually pretraining on UpVoteWeb makes the base model consistently worse over time which results in the observed (linear/exponential) trends in performance decline. If we use a different corpus (Fig 28 in Appendix), such nice predictable trends may not exist. **Fairly obvious conclusion** * In my view, there is nothing surprising or significant in the following statement: "If a model goes through several updates over time, a downstream task patch obtained with first checkpoint will start failing." This paper claims that showing this experimentally as one of their main contributions. Authors should use more realistic experimental setup where continual training is not making the base model significantly worse. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next