ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 3.00 3460
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 6.00 4.00 3341
Fully human-written 1 (33%) 4.00 2.00 3479
Total 3 (100%) 5.33 3.00 3427
Title Ratings Review Text EditLens Prediction
UNDERSTANDING TRANSFORMERS FOR TIME SEIRES FORECASTING: A CASE STUDY ON MOIRAI Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work provides a clean theory for why transformer-style time series foundation models, by focusing on the MOIRAI work. They're showing a transformer can learn to perform least-squares autoregression on univariate series via in-context gradient descent. The authors then prove MOIRAI's any-variate encoding and any-variate attention let the model infer an AR predictor with an arbitrary number of covariates, explaining its strong zero-shot and few-shot behavior across domains. I found the originality, quality, clarity and significance of the work very well. The paper gives a fresh, concrete theory for TS transformers by showing a transformer can do least-squares AR via in-context gradient descent and by explaining how MOIRAI’s any-variate attention builds per-variate lag histories to support multivariate AR, extending prior ICL views to time series with unknown lags. MOIRAI is a perfect paper in my view, and I am happy that a team is working on it. The technical core is strong, and the theory is then stress-tested on real data. The paper also defines the attention variants and the MOIRAI transformer cleanly and keeps the flow from setup easy to follow. Finally, the authors unify why MOIRAI-style FMs work in forecasting, give a generalization story for non-IID series, and show practical gains over tuned AR baselines on ETTh1/ETTm1, so the results matter both for theory and for building stronger pretrained TS models. While I really enjoyed reading the methodology and approaches that the authors followed, I'd like to raise some weaknesses that came to my mind, and I'm looking forward to a nice discussion on these with the authors. (1) From my understanding, the theory and constructions use a ReLU-based attention, and not softmax, so I'm curious to know if the authors are willing to either extend the proofs or add tests showing the key claims still hold with softmax MOIRAI. (2) Another point is that the pretraining bounds hinge on Dobrushin's condition, so I was wondering if the authors could report a practical check for this on ETTh1/ETTm1 or at least share their thoughts on it. (3) Additionally, I found the real-data study a bit narrow because only ETTh1/ETTm1 and comparisons only to AT models were discussed, so maybe adding another domain and strong time series foundation model baselines would make the empirical case much stronger. Do the authors have any thoughts on it? Would be good to know. (4) Finally, the loss uses output clipping (ClipBx), and theory/results depend on qmax and dmax, so I was thinking maybe a short sensitivity study and guidance on choosing these would improve usability. I'd encourage the authors to check the Weaknesses part first, and here I have two more questions. (1) Your theory assumes d≤dmax and q≤qmax with qmax·dmax tied to the hidden size D. How should we expect the model to behave when test-time d or q exceed these limits, and can we anticipate meaningful degradation vs sharp failure from the construction? Would be good to know the answer to it (2) Any-variate attention uses a block matrix U with fixed block size T, so I was wondering how the framework handles irregular sampling or missing timestamps across covariates where a clean block structure does not exist and what time encoding is implied in that case? Thanks again for your hard work. Lightly AI-edited
UNDERSTANDING TRANSFORMERS FOR TIME SEIRES FORECASTING: A CASE STUDY ON MOIRAI Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provides a theoretical analysis of how transformers can perform well for time series forecasting, focusing particularly on the set of time series modeled by AR processes. First, the authors show that transformers with linear attention can approximate AR regression on univariate, as well as multivariate case (based on the any-variate attention in MOIRAI). Second, the authors use Dobrushin's condition to show a generalization bound for why pretraining on diverse datasets works. Finally, the authors ran experiments on synethetic AR data by first pretraining on generated AR data, and then compare the performances on in-distribution AR processes and out-of-distribution AR processes against least squares regression. The paper is mostly easy to follow. It extends the ideas in Bai et al. to show that a transformer can approximate any AR model, and also shows the generalization ability of time series pretraining. It also conducts a few controlled experiments to validate the findings. 1. Writing. I can follow the paper, but there are quite a few typos here and there, and even in the title “Time Seires” should be “Time Series”, - I think the following description of iTransformer might be a bit misleading: “iTransformer (Liu et al., 2023) propose to use a pooling technique to reshape arbitrary number of covariates into a unified size.” - Line 298-299: “.” instead of “,”. For each time series, we encode it with any-variate encoding into an input matrix denoted asH ∈ RD×N , 1 We define each pretraining sample as zj := (Hj , yj ), where yj = x1T j . - This seems to be an overstatement. I believe there are other transformer models, or TSFMs that can handle an arbitrary number of covariates. “Note that in the multi-variate case, we only focus on MOIRAI as it is the only transformer-based model that is compatible with arbitrary number of covariates.” - Move Figure 1 closer to the experiment section. - Line 143, w* = (w1, ..., wj ) \in R^{qd} should be (w1, ..., wd )? 2. The proof on approximating any AR model with a transformer seems to be a straightforward application of the proof in Bai et al., as the main difference is the construction of the feature and label pairs. However, I am not very familiar with related literatures. The authors can correct me if I am wrong. 1. I don’t think the setting in Appendix F6 EVALUATION ON REAL-WORLD DATASETS is appropriate. The window size seems to be too small for ETTm1 and ETTh1 to cover a whole period of time series. This might be the reason that MSE increases along with window size in Figure 3 left, which is counter-intuitive. 2. Line 80: "We impose periodic boundary conditions for the negative index, i.e., x−1 = xT." Why do we need this? It seems unconventional. 3. A recent paper "Why Do Transformers Fail to Forecast Time Series In-Context?" (https://arxiv.org/pdf/2510.09776) argues that a simple linear model has direct access to the full history of the time series, while an LSA-based model must compress this entire history into the fixed-dimensions in the Q/K/V matrices. Based on this observation, they show that for any AR(p) process, an optimally parameterized LSA model cannot achieve a lower expected MSE than the classical optimal linear predictor. Can you compare your analysis with theirs, and share your thoughts on why your findings are different? 4. Section 4 is a bit difficult to follow. Consider adding/moving more insights and intuitions to somewhere earlier in the paper. Fully human-written
UNDERSTANDING TRANSFORMERS FOR TIME SEIRES FORECASTING: A CASE STUDY ON MOIRAI Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper develops a theoretical lens on transformer models for time-series forecasting, with a focus on MOIRAI. It claims three main contributions: (i) existence proofs that transformers can implement (via in context learning -ICL-) least-squares AR regression on univariate series and—via MOIRAI’s “any-variate” mechanism—extend to multivariate AR with an arbitrary number of covariates; (ii) generalization bounds for pretraining under Dobrushin’s condition, yielding error decaying on the order of (1/$\sqrt{nT}$); and (iii) experiments (synthetic + some real-world) showing prediction error decreases with longer lookbacks, consistent with the theory. The motivation is to replace architectural heuristics with a principled understanding of why transformer time-series foundation models (particularly MOIRAI) work well and handle variable covariate counts. In my understanding, this paper draws a new connection between transformer ICL and AR regression for time series, arguing MOIRAI’s any-variate attention and concatenation scheme enables automatic AR dimensionality selection across variable covariate sets. This is a useful theoretical complement to recent empirical TSFMs. The pretraining generalization analysis under non-IID dependence (via Dobrushin) is also timely, given TSFM pretraining on heterogeneous corpora. The positioning against prior ICL theory (largely fixed-dimensional, next-token settings) and time-series FM work is appropriate; the contribution is incremental on the ICL side but significant for time-series where clear theory is scarce. - Bridges ICL theory to time-series AR regression; explains MOIRAI’s ability to handle arbitrary covariate counts. - Non-IID pretraining generalization bound under Dobrushin is relevant to TSFM. - Clear limitations section; transparent about ReLU vs softmax and AR scope. The empirical section aims to validate theoretical trends rather than chase SOTA: synthetic AR data confirms error decreases with longer context and that pretrained MOIRAI adapts across (d,q); there’s a limited real-world section in the appendix. I would like to see stronger baselines beyond LSR (e.g., well-tuned ARIMA/ETS, simple RNN/Temporal-Conv) and ablation isolating any-variate bias terms’ role in practice. Also, reporting statistical variability (multiple seeds) and calibration (since pretraining uses MSE) would strengthen claims. - Empirical findings can be extended: limited real data, modest baseline coverage, lacking variance/calibration analysis. - The Dobrushin assumption is nontrivial; guidance on when it holds in common TS domains would help external validity. - Several results rest on formatting assumptions and constructed sequences; practical robustness across typical TS preprocessing is less clear. - Can you characterize classes of TS (e.g., ARMA, VAR with certain stability) that satisfy Dobrushin and provide diagnostics to check it in practice? - How sensitive are your AR-via-ICL constructions to positional encoding choices and patching (MOIRAI uses patching by default)? Can you provide ablations? - Is it possible to add stronger empirical baselines and calibration (e.g., CRPS or PIT histograms) on real datasets to complement theory? - Beyond AR, can your constructions cover state-space or seasonal/trend components? Even partial results would broaden scope. - Where does any-variate bias (u₁, u₂) matter most? An ablation isolating those terms would be informative. Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next