ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 8008
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 5.33 3.67 3206
Total 4 (100%) 5.00 3.75 4406
Title Ratings Review Text EditLens Prediction
Zero-shot Forecasting by Simulation Alone Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces SarSim0, a novel, scalable, and computationally efficient pipeline for generating synthetic time series data for zero-shot forecasting. The core of the simulator is the classical SARIMA statistical model, but with a key innovation: instead of sampling unstable model coefficients, it samples stable poles directly from the characteristic polynomial's stability region, guaranteeing well-behaved series. The pipeline enriches these base signals through a compositional "SARIMA-2" scheme for multi-seasonality and adds realism with heavy-tailed noise models to capture burstiness. The authors demonstrate that training standard neural architectures (NBEATS, PatchTST) exclusively on this synthetic data, generated on-the-fly, achieves state-of-the-art zero-shot performance on established benchmarks. Notably, these relatively small models outperform massive, real-data pretrained foundation models and, in many cases, even surpass the forecasting accuracy of the "teacher" model (AutoARIMA) from which the data-generating process is derived. * **Core Technical Problem:** The primary strength of this work is its novel and principled solution to the instability problem in autoregressive model simulation. By shifting from sampling coefficients to directly sampling the poles of the transfer function within the unit circle (Section 4.1), the authors guarantee the generation of stable, non-divergent time series. This is a technically sound and elegant idea that makes the powerful but notoriously fragile SARIMA framework a viable engine for large-scale data generation. The resulting speed and on-the-fly capability (Figure 4) represent a significant practical advantage over slower, more complex methods like kernel-based synthesis. * **State-of-the-Art Zero-Shot Performance from Synthetic Data Alone:** The paper's empirical results are strong and challenge a key assumption in the field. Table 1 shows that a standard PatchTST model trained solely on SarSim0 data not only outperforms prior synthetic-only methods by a massive margin but also achieves the best overall scores on the GIFT-Eval benchmark, surpassing large foundation models like Chronos, MOIRAI, and TimesFM that were pre-trained on vast amounts of real data. This is a landmark result, demonstrating for the first time that a purely synthetic corpus can be a viable—and even superior—substitute for real-world data in zero-shot forecasting. * **Demonstration of Emergent Generalization ("Student-Beats-Teacher"):** A deep and significant finding is that neural models trained on SarSim0 data consistently outperform their own data-generating "teacher," AutoARIMA, on the GIFT-Eval benchmark. This is not trivial. It suggests that by learning from a vast and diverse *distribution* of SARIMA processes, the neural network is not merely fitting the teacher's behavior but is learning a more fundamental and robust representation of time series dynamics. This emergent capability is a powerful argument for the synthetic-first pre-training paradigm. * **Methodology:** The authors demonstrate a commitment to rigorous, leakage-free evaluation. They operate under a strict zero-shot protocol and provide a detailed breakdown in Appendix C (Table 5) of how prior foundation models violate this protocol by pre-training on evaluation data. This meticulous approach adds significant credibility to their claims and serves as a valuable contribution to the community by highlighting pervasive issues in model evaluation. The ablation study (Table 2) is also effective in validating the importance of the SARIMA-2 and Noiser components. In my opinion, the paper's reliance on the SARIMA framework introduces inherent limitations, and the scope of its claims could be tempered by a more nuanced discussion of the benchmarks and the simulator's own complexity. 1. **Inherent Linearity of the SARIMA Core:** The SARIMA model, which forms the backbone of the simulator, is fundamentally a linear process model. While the paper adds complexity via superposition and non-Gaussian noise, it cannot natively generate time series with core non-linear dynamics, such as regime switching, threshold effects, or volatility clustering (GARCH-like effects). Many real-world series, particularly in finance and econometrics, are driven by such phenomena. The paper's success on the chosen benchmarks may indicate that these benchmarks are dominated by seasonal/trend components, but the simulator's applicability to more complex, non-linear domains remains an open question. 2. **Opaqueness of the Simulator's "Meta-Tuning":** The SarSim0 pipeline itself has a large number of hyperparameters that define the distribution of generated series: the ranges for AR/MA/Seasonal orders, the maximum pole radii, the parameters for seasonality pairs, and the distributions for the Noiser module (Table 8). The paper presents a configuration that works exceptionally well, but it provides no insight into how this configuration was determined. The performance of the downstream "student" models is critically dependent on the quality and diversity of the "teacher's" curriculum, and this meta-level tuning process is a crucial, yet undiscussed, part of the methodology. 3. **Potential for Overgeneralization from Current Benchmarks:** The evaluation is performed on the M-Series and GIFT-Eval benchmarks, which are excellent but are heavily composed of business, economic, and demographic time series. These are domains where SARIMA-like models have historically excelled. It is unclear if the remarkable performance would transfer to domains with fundamentally different characteristics, such as chaotic physical systems, high-frequency financial data, or irregular biological signals (e.g., ECG). The simulator's strong inductive bias is a feature, but it may also be a limitation. Based on my concerns listed above, I pose the following questions to the authors: * **Question 1:** The SarSim0 pipeline is built on a SARIMA core, which is fundamentally a linear process model. How do you see this limiting the simulator's ability to generate data with complex non-linear dynamics (e.g., regime shifts, GARCH effects), and could this explain why models pre-trained on it might struggle on certain real-world domains not covered by the benchmarks? * **Question 2:** Table 8 shows a large number of hyperparameters for the SarSim0 pipeline. How were these settings chosen? Could you provide an ablation or sensitivity analysis on key parameters, such as the maximum pole radius (`rmax`) or the range of AR/MA orders, to demonstrate the robustness of the approach? * **Question 3:** The "student-beats-teacher" finding is compelling. However, the effect appears mixed on the M-Series, where AutoARIMA's MASE is stronger than NBEATS-SarSim0. What properties of the M-Series datasets might explain why AutoARIMA remains more competitive there, and does this suggest limits to the emergent generalization? * **Question 4:** The ablation study shows that removing the Noisers can sometimes *improve* performance on the M-Series. This suggests a mismatch between the noise model and the real-data characteristics. Does this finding point to a need for more adaptive or domain-specific noise generation, and how might that be incorporated into the pipeline? * **Question 5:** The current work focuses on univariate time series. What are the primary conceptual and technical hurdles to extending the SarSim0 methodology, particularly the stable pole sampling approach, to the multivariate case where capturing cross-series dependencies (e.g., via a VARIMA framework) is essential? * **Question 6:** The UMAP visualization in Figure 3 shows impressive overlap. However, are there any noticeable "holes" or regions in the real-data embedding space (red) that the synthetic data (blue) fails to cover? What kind of real-world patterns might these correspond to, and do they align with the known limitations of SARIMA models? Fully AI-generated
Zero-shot Forecasting by Simulation Alone Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces SarSim0, a fast synthetic time‑series generator. The authors train forecast models entirely on simulated data and then test their zero‑shot performance on real datasets from the GIFT-Eval and M4 datasets. The SarSim0 pipeline first samples stable SARIMA processes by sampling AR/MA poles inside the unit circle, then composes bi‑seasonal structure via an additive/multiplicative SARIMA‑2 modulation, and finally overlays level‑dependent heavy‑tailed noise (Poisson, generalized‑Gamma, lognormal) to capture burstiness and intermittency. Because the authors’ approach generates data on‑the‑fly at scale, compact backbones (N‑BEATS, PatchTST, MLP) trained solely on a large set of SarSim0 time series achieve strong cross‑frequency zero‑shot accuracy matching or surpassing larger pretrained models like Chronos. The authors also show that pretraining with their synthetic data leads to stronger results on GIFT-Eval than pretraining with earlier synthetic data sources (PFN and KernelSynth). When comparing to fully-trained models, the authors find that models pretrained on SarSim0 data outperform direct forecasts using AutoARIMA, a phenomenon the authors refer to as “student beats teacher” generalization. The SARIMA-2 approach is an appropriate generating model that describes many practical time series. As noted by the authors, many demand or utility time series benchmarks consist of a slow process (like inflation in a financial time series) modulating a fast process (like seasonal demand). Within this generating model class, the authors make appropriate choices to enforces stability in the AR dynamics by sampling poles inside the unit circle. This likely helps preserve diversity in the generated time series. The evaluations and experiments are appropriate in scope. The authors use GIFT-Eval, which is the current leading benchmark for zero-shot evaluations. They also perform ablations of each component of their generator, allowing them to identify that SARIMA-2 (biseasonality) is a key driver of their approach. I also like the experiment design of comparing fully-trained NBEATS/PatchTST against variants pretrained on prior synthetic data generators (KernelSynth/PFN). The authors give a nice demonstration that their generated data co-clusters with real datasets from the M4 competition. Because nonlinear embeddings are neighbor based, this helps show that their data convincingly “passes” as real time series data **Novelty.** This is not the first paper to train a zero-shot forecast model purely on simulation data and report strong results. ForecastPFN and TabPFN first used this approach, while Mamba4Cast uses an SSM on PFN synthetic data to achieve strong results on the original GluonTS datasets. While the idea itself is not the first of its kind, the arguments in favor of this paper could be (1) the particular choice of synthetic data generation, and (2) the empirical results. I do not currently feel that either is strong enough to recommend acceptance. Ad hoc data augmentation. Regarding the data generation procedure, I consider the approach to be sufficiently ad hoc that it’s hard to see why a SARIMA approach would be universally preferable. There are not any theoretical guarantees or general bounds that establish that this approach will always produce more diverse or realistic data than prior methods. Prior approaches fall broadly into PFN and its variants (as in ForecastPFN/TabPFN), augmentations thereof (as in Mamba4cast, which introduces spike injection and damping augmentation, or Gaussian-process based methods like KernelSynth used by Chronos. There’s also structured approaches like CauKer, which combine Gaussian Processes with a structured causal model. I can see the author’s ARIMA-based procedure as helping cover data types not covered by other methods; for example, long history dependence when the ARIMA process has a long effective memory due to slow damping. However, it’s hard to argue that any of these dataset generators are universally preferable. **Experiments.** The experiments show that models like NBEATS or PatchTST become competitive with models like Chronos when pretrained on the SarSim0 dataset. I don’t consider these results strong enough to believe that they justify selecting this data generator over others, because I think there is a bit of a “no free lunch” here: the authors pick a generator that handles multiseasonality well, and it turns out this is useful for M4 and GIFT-Eval, which has many time series that fall into that class. But had the authors picked a multi-scale or trend-dominated test set, a different heuristic for data generation approach could have been preferable. Since there are infinite choices that could be made during the data generation procedure, the overperformance of this particular set of choices mainly tells us about the properties of GIFT-Eval and M4. **Emergence.** I do not agree with the “emergent effect” argument. A neural model pretrained with SARIMAX datasets outperforming AutoARIMA does not imply emergent effects. Firstly, Sarima-2 does not directly fall within the model class directly expressible by AutoARIMA. Secondly, initialization and hyperparameters can prevent a model like AutoARIMA from exactly fitting the true generator. 1. The unit circle pole requirement acts as a stability condition on the AR process generating the data. Are there any mathematical conditions or theoretical guarantees on the distributional properties of the resulting time series? 2. Can you directly quantify the verisimilitude and diversity of your generated data, particularly compared to prior PFN or structured causal model approaches? For example, since you have a UMAP, you could calculate a silhouette score. A statistical significance test would add depth to this comparison. 3. Are you fine-tuning Chronos, Moirai, etc on the SarSim0 dataset? I could not tell if these experiments are what are shown in the first rows Table 1, or if that shows the base performance of those models pretrained on their original pretraining splits. 4. I am confused by Table 1. Are the three models (lines 387-389) fully-trained on the context? And why are Chronos-Base and others not marked as zero-shot models? Fully human-written
Zero-shot Forecasting by Simulation Alone Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a new pipeline to generate synthetic data for the pre-training of time-series foundation models. They show that models trained purely on their synthetic data can outperform leading time-series foundation models trained on real data. Being able to generate high-quality synthetic data that fully replace real data is a big deal. This has the potential to overcome the scarcity of real-world time-series data and allow scaling time-series foundation models beyond the current billion-parameter regime. There are some strong inductive bias built into the SARSIM0 process, such as bi-seasonality, which are well suited for the tasks in GIFT-EVAL and M-Series benchmarks but may fail in other application domains, especially in scientific machine learning. See my questions below. * How well do models trained on your synthetic data perform on predicting dynamical processes not described by ARIMA processes? For example, deterministic chaos. * I am asking this because SARSIM0 has a very strong inductive bias towards time series with seasonality or bi-seasonality, while chaotic time series are aperiodic. Can the authors comment on how models trained with SARSIM0 fair with aperiodic time series in general? * Have you tried train a model with both SARSIM0 data and real data? Will it outperform models trained on either synthetic data or real data alone? Fully human-written
Zero-shot Forecasting by Simulation Alone Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose a scalable synthetic time-series generation pipeline based on a seasonal autoregressive integrated moving average (SARIMA) model. Their proposed simulator follows a three-step procedure that samples traces from the “polynomial stability region”, then combines multiple paths into rich multi-seasonality traces, then adds noise to capture burstiness and intermittency. They show that their simulator allows for strong zero-shot generalization. - The work is well-motivated. The authors point out how training with real series can be limited by licensing barriers, data scale, domain etc. And that training with synthetic data offers unique levers that can be controlled. - There is very limited analysis conducted with the experiments. First, GIFT-Eval allows for easy analysis of performance stratified by domain/frequency/term length/ variate type etc. These are more important to understand the limitations and strengths of the model, more than the aggregate score on GIFT-Eval. - To further improve the evaluation, I would suggest adding one of the foundation model baselines (e.g. Chronos) trained from scratch on KernelSynth/ForecastPFN/SarSim0. Then it would allow for apples-to-apples comparison with the pretrained Chronos. - There is no analysis on how this dataset can be used to complement other datasets, which is important to understand. I suggest the authors combine GIFT-Eval train (or a different training set) with this training dataset. **Final verdict**: I give a score of 6 as I see this as a paper that can be accepted. Conditional on addressing the weaknesses and answering questions, l might consider upgrading my score to 8 (if I am satisfied with the answers). 1. Regd the second set of rows in Table 1: What is NBEATS, PATCHTST and AutoARIMA trained on here? It should be made clear in the table. 2. Suggestion: can the authors bold the best numbers in each column, say in Table 1, 2 etc.? It is difficult to spot which models are the best Fully human-written
PreviousPage 1 of 1 (4 total rows)Next