|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors propose wrapping patch-based time series forecasting foundation models (focusing on Time-MoE) in a state space model encoder-decoder architecture that reduces the number of patches passed to the inner model. The experiments provided suggest that this does not significantly affect predictive metrics, but speeds up pretraining.
- The idea of wrapping a transformer model in an SSM encoder-decoder structure to speed it up is well-motivated and could have applications in other areas that use transformers.
- The paper is clear and well-written.
- Thorough ablations are provided.
- The time series forecasting models cited and evaluated against are out of date. Baseline evaluations are taken from Time-MoE (ICLR 2025) and have not been updated to include more recent models, such as Sundial [1] and Moirai-MoE [2], that have advanced the state-of-the-art in the meantime. As such, this paper's claim of state-of-the-art performance is not demonstrated. Posting models on the GIFT-Eval benchmark [3] has become common practice in this area, and many more recent performant models can be found there. (Ideally, the Chronos evaluations would also be updated to at least use Chronos-Bolt, released Nov. 2024.)
- The specific focus on Time-MoE in the method design and evaluations limits the impact of the paper, given the advances in the field since its publication. Without further discussion and evaluation, it's not necessarily clear that this method could be applied on top of more recent models and perform as well.
- Only five evaluation datasets are used - this is extremely limited and not in line with recent papers in this area. Even among earlier papers, MOMENT, TimesFM, Moirai, and Chronos all use dozens of datasets in their evaluations. Again, GIFT-Eval is an example of a large evaluation set that has become commonly used.
- A compelling explanation is lacking for placing patch boundaries based on the difference between neighbouring samples. For a seasonal input, this means focusing on the areas between peaks and troughs, but it's not clear why they should be treated as more relevant. Empirical justification could be all that's available, but it would strengthen the paper if some insight could be provided.
- It's not clear to me that it's very impactful to speed up pretraining of time series foundation models without improving other aspects of performance, given that they tend to be relatively cheap to train as far as foundation models go, and the zero-shot capability means pretraining only has to be done once. One possible benefit would be allowing more scaling of model size, but the results suggest that doing so does not help performance.
[1] Liu et al. "Sundial: A family of highly capable time series foundation models" ICML 2025.
[2] Liu et al. "Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts" ICML 2025.
[3] https://huggingface.co/spaces/Salesforce/GIFT-Eval
- Ablations compare to using a fixed patch length of 4 but the dynamic patching seems to prefer using patch length 2 in many cases - have you evaluated a fixed patch length of 2? (Acknowledging that this reduces the speedup benefits.) |
Fully human-written |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes dynamic patching and lightweight state-space encoder for time series prediction. Experiments are conducted on 5 datasets with zero-shot and finetuning settings.
1. This paper is easy to follow.
2. Experiments are conducted on 5 commonly used datasets.
3. The experiments are conducted with zero-shot and finetuning settings.
1. The claimed main contributions in this paper say that "incorporate dynamic, content-aware patching". However, this has been well studied in time series foundation models, such as [1-3] which propose dynamic and/or content-aware patching. The claimed contributions and novelty are limited.
[1] HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting. AAAi 2024.
[2] Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach. ICML 2024.
[3] LightGTS: A Lightweight General Time Series Forecasting Model. ICML 2025.
2. Key experimental comparison for baselines [1-3] is missing. I briefly checked the finetuning settings where the performance of this work is worse than [3]. And the efficiency of this work is worse than [3].
3. The authors 'validate TimeSqueeze across diverse zero-shot forecasting benchmarks, achieving performance on par with state-of-the-art point embedding models while delivering up to 20× faster training and 10× faster inference'. Why not compare efficiency with patch embedding models?
4. Does the improvement come from downsampling pretraining data as it reduces the bias?
5. No code for reproducibility.
see weaknesses. |
Fully human-written |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
TimeSqueeze presents a well-executed and impactful approach to efficient long-context time-series modeling via dynamic patching. Its hybrid design and strong empirical results make it a valuable contribution. However, the work could be strengthened by broader comparisons with adaptive compression methods and a more in-depth analysis of the learned representations.
1. The introduction of TimeSqueeze, which dynamically combines point-level fine-grained encoding with adaptive patch-level compression, is a novel and well-motivated. The dynamic patching mechanism based on relative deviation effectively addresses the limitations of fixed-size patching and enables content-aware compression.
2. The paper demonstrates compelling efficiency gains (up to 20× faster training and 10× faster inference) while maintaining competitive forecasting performance with state-of-the-art point-embedding models like Time-MoE in both zero-shot and full-shot settings across multiple benchmarks.
3. The authors provide extensive experiments, including comparisons with strong baselines, detailed ablation studies, and analyses of the impact of pre-trained context length and compression rates, which convincingly validate the design choices and scalability of the proposed method.
1. While the paper compares with fixed-patching methods and point-embedding models, it does not include comparisons with other adaptive or learned compression strategies from recent literature (e.g., learned chunking or entropy-based methods), leaving the relative advantage of the proposed patching criterion less fully contextualized.
2. Although the paper shows that longer pre-trained contexts improve performance, the analysis is limited to performance curves without deeper investigation into what temporal structures or dependencies are better captured, or how the dynamic patching interacts with long-range modeling.
Please see weaknesses! |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes TimeSqueeze, a hybrid forecasting architecture that fuses point-wise and dynamic, content-aware patch-wise representations for efficient time series forecasting, particularly in long-context regimes. TimeSqueeze introduces a lightweight state-space encoder to extract fine-grained features, followed by an adaptive patching mechanism that groups time points into variable-sized patches based on local signal complexity. This yields a variable-resolution sequence processed by a Transformer (MoE) backbone. Experiments demonstrate substantial improvements in computational efficiency versus point-embedding baselines, while maintaining strong (often comparable) forecasting accuracy across zero-shot and full-shot scenarios on established long-range benchmarks.
1. TimeSqueeze innovatively combines state-space encoders with adaptive dynamic patching for time series, addressing a well-known bottleneck of fixed patching and inefficient context scaling in Transformer models. This hybridization enables granular feature preservation where needed and aggressive compression elsewhere, a nuanced but underexplored trade-off.
2. The dynamic patching strategy is clearly described and mathematically motivated (see the explicit formulation of the patch boundary condition on Page 4). The model architecture (Figure 1) is systematically illustrated, showing integration points for patching, unpatching, and multi-horizon prediction.
3. Numerous ablation studies probe each critical component (patching, encoder type, positional encoding), and visualizations (Figures 8-11) clarify how patch sizes adapt responsively to data domains.
1. While Section 3 briefly gestures at point-wise decomposability, the model's explicit capability for multivariate or exogenous feature forecasting is only cursorily addressed, with no empirical analysis contrasting, e.g., approaches like Crossformer or TimeXer. This reduces clarity on generality and limits the scope of significance, especially for real-world multivariate forecasting use cases.
2. Potential Hyperparameter Sensitivity: The choice of patching threshold $\tau$ is acknowledged as data-dependent and requires tuning. However, the ramifications (e.g., stability, transferability across domains, optimal selection strategies) are not robustly quantified—raising concerns about practical usability and the risk of model brittleness. The only discussion occurs at a high level in the Conclusion, and more rigorous empirical explanation (e.g., full-sweep results for a range of $\tau$ on multiple datasets) is absent.
3. While the patching function is presented cleanly, some important aspects remain underspecified—for instance, the way boundaries are handled when signals have sudden global changes, or how minimum/maximum patch sizes interact with highly nonstationary regions (see dynamic patching on Page 4 and visualization in Figures 8–10). Additionally, the role of variable-length unpatching in SSM/Transformer composition warrants deeper theoretical and implementation clarity. The claim of strict causality preservation could be accompanied by formal or simulation evidence to eliminate any ambiguity.
4. While Figure 5 (Appendix D) plots the MSE versus compression rate, there is little theoretical guidance or model explaining these trends, nor discussion of limits of the dynamic patching regime in catastrophic or highly non-stationary settings.
1. How robust is the threshold $\tau$ selection across datasets with highly variable information density? Are there scenarios where patch boundary assignments lead to overcompression or undercompression? Please provide more empirical analysis, including out-of-domain or adversarial examples.
2. Does the architecture generalize robustly to multivariate or exogenous-variable tasks, as per the settings in Crossformer or TimeXer? What adjustments (if any) are required for such use cases?
3. Are there practical deployment scenarios (e.g., real-time forecasting in resource-constrained environments) where the patch boundary computation or unpatching steps impose bottlenecks? What is the end-to-end wall-clock speedup, including patching/merging steps? |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes TimeSqueeze, a dynamic patching architecture for efficient long-context time series forecasting. The model addresses the trade-off between fine-grained temporal resolution and computational scalability. TimeSqueeze introduces a two-stage hybrid representation: 1. A lightweight state-space encoder extracts fine-grained temporal features from the full-resolution time series. 2. An adaptive patching module dynamically adjusts patch sizes, assigning smaller patches to regions with complex temporal variations and larger ones to stable segments. This variable-resolution representation allows the Transformer backbone to process fewer tokens without losing critical information, improving both efficiency and accuracy.
1. The methodology is well-motivated and clearly integrated into the forecasting framework.
2. The paper is clearly written and conceptually intuitive.
1. The experimental validation is limited, as the evaluations are conducted only on the Time-MoE architecture, which restricts the generality of the conclusions.
2. The overall architecture of TimeSqueeze largely builds upon the Time-MoE framework — equations (2–4) are directly inherited from the original Time-MoE paper — and the idea of dynamic patching has already been explored in several prior works.
3. The efficiency comparison with Time-MoE is not entirely fair, since Time-MoE is intentionally designed to maximize model capacity by using point-wise rather than patch-based embeddings. Therefore, a more appropriate efficiency analysis should include lightweight baselines such as SparseTSF, TimeBase, or DLinear.
4. The full-shot forecasting experiments lack strong state-of-the-art baselines such as CycleNet, TQNet, TimeBase, or DUET, which makes it difficult to assess the claimed superiority of the proposed model.
Have you considered extending the TimeSqueeze architecture to handle multi-dimensional time series data, such as those involving spatial-temporal correlations? |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a hybrid forecasting architecture that combines the strengths of point-wise and patch-based embeddings through dynamic time series compression. It comprises a lightweight state-space encoder that uses point-wise embeddings to process full-resolution time series and extract fine-grained temporal features. An adaptive patching module then prunes these features using variable-sized patches, assigning smaller patches to information-rich regions and larger patches to redundant segments.
S1. This paper presents a hybrid forecasting architecture to incorporate dynamic, content-aware patching for adaptive compression in time series.
S2. The experimental findings validate the computational efficiency of the proposed method.
1. Time series data often exhibit periodic and trend patterns. Relying solely on single-step differences between adjacent samples to determine boundaries may be insufficient for capturing periodic boundaries or trend changes.
2. The patching mechanism determines boundaries by comparing the absolute difference between adjacent samples with the local average power within a sliding window. Could the authors clarify how this criterion effectively distinguishes between information-rich and redundant regions?
3. The boundary selection depends on the design of the sliding window, yet the paper does not clearly specify whether the window is overlapping or non-overlapping.
4. Experimental results show that model performance decreases as the compression ratio increases, which may be due to excessive information loss caused by over-compression. Intuitively, using a smaller compression ratio might improve performance, but the paper does not provide corresponding experiments.
5. The current experimental results show limited forecasting performance, and the paper does not include comparisons with recent Time Series Foundation Models (TSFMs), such as Sundial, LightGTS, and VisionTS.
6. The paper lacks an analysis of patch distribution across different datasets. It would be valuable to examine how patch length, density, or boundary frequency vary among datasets with different statistical characteristics.
See Weaknesses. |
Moderately AI-edited |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes TimeSqueeze, a method for efficient long sequence time-series forecasting using dynamic patch interpolation. It learns to compress long input sequences into a small number of informative patches by interpolating around learnable query points with soft Gaussian weights. This reduces computational cost while maintaining forecasting accuracy. TimeSqueeze achieves strong results on standard benchmarks and can be plugged into existing transformer models.
The paper introduces TimeSqueeze, a novel module for time-series forecasting that performs dynamic patch interpolation to reduce sequence length while preserving important temporal structure.
This adaptive downsampling technique, which uses learnable interpolation kernels, significantly reduces computational cost while maintaining competitive or superior forecasting accuracy.
The approach is modular and model-agnostic, allowing it to be integrated into various transformer-based backbones without architectural changes.
The authors provide extensive experimental validation across standard benchmarks, demonstrating clear gains in efficiency (speed and memory) alongside strong predictive performance.
While the method is effective at reducing computational overhead, it lacks discussion on how the dynamic patching and interpolation process affects interpretability or transparency of the learned representations.
The paper does not offer a formal theoretical framework to understand the trade-offs between compression rate and information loss, especially in rapidly changing signals. The exclusion of non-transformer baselines, such as state-space or statistical hybrid models, weakens the comparative rigor of the evaluation.
Could the authors provide a qualitative or quantitative analysis of the learned patching structure, and whether it consistently adapts to different temporal patterns such as trends, seasonality, or sudden shifts?
Is the model capable of handling irregularly sampled or incomplete time-series, or does it require preprocessed, evenly spaced inputs for effective interpolation?
Is this method good for financial time series data, which can be influenced by many factors and hard for time series forecasting? |
Fully AI-generated |