|
Lossless Compression: A New Benchmark for Time Series Model Evaluation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors argue that existing benchmarks focus primarily on task-specific performance and fail to assess a model’s ability to capture the full generative distribution of the data. To address this, the paper introduces lossless compression as a novel benchmark for evaluating the general modeling capabilities of time series models. The authors also provide a theoretical justification for the equivalence between the compression objective and probabilistic modeling goals. Experimentally, the paper evaluates the compression performance of various models on real, synthetic time series and cross-modality data, demonstrating the validity of the proposed lossless compression benchmark.
1. The work is grounded in information theory and motivates lossless compression as a principled evaluation metric. The connection between compression length and KL divergence is clearly derived and well-explained.
2. The benchmark covers multiple datasets, models, and cross-modal compression scenarios. Experiments are thorough, and the evaluation framework has been made openly available.
3. The authors analyze the relationship between compression and four classic time series tasks, finding that compression performance maintains a high positive correlation with performance on these tasks.
1. Lossless compression is a more general and domain-agnostic task for evaluating the modeling capability of models, beyond time series models. Although it can be a useful supplementary metric for assessing time series models, it should not be considered a canonical task for evaluating their ability to model temporal dynamics — which remains the core objective of time series modeling.
2. In the lossless compression pipeline, each 32-bit floating-point value in the raw file is decomposed into a sequence of four 8-bit integers. This raises two concerns: (1) In the IEEE 754 standard, a 32-bit float consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Splitting a float into four 8-bit segments disrupts this meaningful internal structure. (2) After such decomposition, the resulting byte sequence no longer preserves the original temporal dynamics of the data, making it questionable whether lossless compression is an appropriate benchmark for evaluating time series modeling capabilities.
3. As a benchmark, the motivation for the selection of datasets and baselines should be further clarified. (1) Are there any characteristics of the datasets that may influence lossless compression performance? Some dataset analysis will be helpful to justify the rationality of the benchmarks. (2) What is the rationale for selecting the current set of baseline models? Furthermore, the baselines could be expanded to include a wider range of architectures, such as RNN-based models, statistical methods, machine learning approaches, and even foundation models.
4. Some experimental settings are missing; specific questions are listed in the Questions section.
1. How does the framework handle the inherent structure of multivariate data? Does concatenating channels into a single byte stream risk obscuring inter-channel dependencies that the model might otherwise capture if it knew the channel boundaries?
2. The paper supports two training paradigms, including "density estimators trained on raw values." How are non-autoregressive models integrated into the compression pipeline? How are their probabilities mapped to discrete symbol-level distributions for arithmetic coding ? Does the difference in the method of probability distribution derivation significantly influence the performance of lossless compression ?
3. TimeXer and iTransformer include modules designed to process multivariate inputs. However, since lossless compression aims to compress a univariate symbol stream, both models do not appear to be directly adaptable to this task. In particular, their mechanisms for handling multivariate inputs may not function as intended in this context. How do the authors adapt these models for lossless compression ?
4. How is the experiment on the relationship between lossless compression and classical time series tasks conducted? It appears that some baselines do not support all tasks simultaneously, and certain adaptations are required to apply a time series model to different tasks. |
Lightly AI-edited |
|
Lossless Compression: A New Benchmark for Time Series Model Evaluation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduced lossless compression as a new benchmark for evaluating time series models based on Shannon’s source coding theorem. The proposed TSCom-Bench converts time series models into compressors via arithmetic coding. The experiments reveal distributional weaknesses overlooked by classic benchmarks.
S1. The paper provides a pluggable evaluation pipeline with well-defined IEEE-754 and evaluation metrics.
S2. The equivalence between fully capturing the data distribution and achieving optimal compression provides a theoretical justification for the evaluation.
S3. Experiments across six time series benchmarks and seven cross-modality compression datasets demonstrate that the potential feasibility.
W1. Most time-series applications do not primarily require compression, as their main objectives are accurate forecasting and anomaly detection. Although the paper claims that lossless compression benefits data storage, there already exist many general-purpose lossless compression algorithms (e.g., XZ, Brotli, LZ4) and specialized time-series compression methods (e.g., ELF, Chimp, Camel). It is unclear what motivates the use of compression as an evaluation criterion for time-series models and what the practical application scenarios are.
W2. The paper only compares several time-series models across six benchmark datasets, but does not include comparisons with specialized time-series compression algorithms (e.g., ELF, Chimp, Camel). Without such baselines, it is difficult to assess the relative compression performance of these models.
W3. The paper encodes data using IEEE-754 32-bit, which may not capture high-precision time-series data. Although the paper mentions that “16/64-bit can be evaluated in ablations,” the ablation study does not present any comparison results for 16- or 64-bit settings.
W4. The paper states that “the CR values of PEMS08 are close to 1 due to the .npz storage file.” However, this claim lacks sufficient evidence. The authors could provide a comparison with .npz results from other datasets to support this explanation.
Q1. If the poor performance on PEMS08 is due to issues with the .npz format, why not simply convert the data format during the preprocessing stage or use another time series dataset?
Q2. There are typographical errors in the section captions of the appendix A.6, A.7, A.8, A.9. |
Lightly AI-edited |
|
Lossless Compression: A New Benchmark for Time Series Model Evaluation |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper identifies a gap in the standard evaluation paradigm for time series models. Authors argue that the four canonical tasks: forecasting, imputation, etc. are theoretically limited as their objectives (e.g., MSE) primarily constrain the conditional mean, not the full generative distribution. This paper formally proposes lossless compression as a 'fifth canonical benchmark' for general TS models. This is grounded in Shannon's source coding theorem, where optimal code length equates to the negative log-likelihood. Thus, minimizing compression loss is equivalent to minimizing the KL divergence between the model's distribution and the true data distribution. And it introduces and open-sources TSCom-Bench allows any TS model to be used as a probabilistic backbone for a standard arithmetic encoder to evaluate its compression performance.
1 The proposal to use compression as a general evaluation benchmark for TS models is novel. Prior TS-related compression research (e.g., TRACE) focused on building new compressors, not on creating a benchmark to evaluate general-purpose models, but this work is the first to formally transfer this evaluation concept from NLP/information-theory to the time series domain.
2 The paper provides a rigorous and unified information-theoretic critique of the four existing canonical tasks. This synthesis formally justifying the need for a new benchmark, represents a specific contribution to the field.
3 The TSCom-Bench framework fills a gap in the existing open-source ecosystem. Major TS libraries (e.g., TSLib, tsai) do not include compression as an evaluation task or metric.
4 The framework's pipeline is a methodological innovation, it enables the re-purposing of models designed for forecasting (i.e., predicting a conditional mean) to be evaluated on a much stricter byte-level probabilistic prediction task.
5 The paper's benchmark of SOTA models on compression (Table 1) presents new empirical results for the field. The original papers for these models report only MSE/MAE. The resulting discovery of a 'metric mismatch' is a novel finding.
1 The paper frames the incompressibility of the PEMS08 dataset as an empirical discovery that 'serves as a crucial validation'. This is a misrepresentation. The .npz file format is widely known to be a compressed archive, as confirmed by public documentation. This fact is a-priori knowledge, not an empirical finding of this work. This weakness is minor and can be corrected by re-framing the claim.
2 While the specific discovery of a forecasting-vs-compression mismatch in TS is novel, the paper fails to cite the general concept of 'loss-metric mismatch,' which is established in the wider machine learning literature . Adding this context would better situate the specific empirical contribution.
1 Given that the .npz format is known a-priori to be a compressed archive, would the authors be willing to re-frame this point? We suggest moving this fact to the 'Datasets' section and presenting the CR close to 1.0 result in the 'Experiments' section as a confirmation of the benchmark's correctness, rather than as a discovery.
2 Could the authors add a brief discussion and citation to the general literature on 'loss-metric mismatch' to help contextualize the novelty of their specific empirical discovery in the TS domain?
3 The idea of compression as a foundational pre-training objective is a significant conceptual contribution. Have the authors considered elevating this point more prominently in the Abstract or Introduction, as it frames the work's long-term impact beyond just a new benchmark? |
Heavily AI-edited |
|
Lossless Compression: A New Benchmark for Time Series Model Evaluation |
Soundness: 3: good
Presentation: 1: poor
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel evaluation metric for time series modeling: lossless compression. Based on information theory, this approach posits that the ability to compress a time series can serve as a robust benchmark, as effective compression requires learning meaningful representations of the data. Compared with task-specific benchmarks like the MSE in forecasting, lossless compression provides a stricter evaluation, mitigating problems like modelling a shortcut. The workflow of lossless compression evaluation is composed of serialization, probabilistic modelling, and an arithmetic encoding process. Empirical studies not only verify the feasibility of using lossless compression as a benchmark for state-of-the-art time series models but also demonstrate its utility in capturing the characteristics and focus areas of different datasets.
(1) The theoretical analysis and supporting proofs are sound and well-presented. \
(2) The paper effectively connects the shortcomings of widely-used industrial benchmarks in various downstream tasks to the advantages of lossless compression, emphasizing its necessity and applicability. \
(3) The experimental results not only shows its ability to benchmarking time series models, but also illustrate the effectiveness of capturing emphases of different datasets.
(1) There are cases of misusing the Latex format, especially regarding citations. \
(2) The overall structure of the paper lacks clarity, which makes it difficult to follow. For example, the discussion around Eq. (1) feels more suitable for the preliminaries and motivation, yet it appears in Section 1. Additionally, the discussions in Section 4 about four canonical tasks are overly detailed; moving these details to an appendix would improve the coherence of the main text. \
(3) As a universal benchmark, apart from the end-to-end approches, the baselines should include some representation learning models, e.g., TS2Vec, TS-TCC and CPC.
(1) Although the theoretical foundation is solid, how can lossless compression be accepted as a new benchmark for practical usage? Compared to industry-standard benchmarks like MSE or MAPE, which measure forecasting error, what is the more intuitive physical meaning of lossless compression as a benchmarking metric? For example, when performing sales prediction, MAPE clearly shows the percentage deviation between predicted and actual values. When performing behavioral sequence classification, accuracy quantifies the correctness of model predictions for different classes. What is the corresponding intuitive physical interpretation for lossless compression when used as a benchmark? \
(2) All models are serialized before evaluation. Does this step potentially undermine the effectiveness of certain models, thereby causing unfairness in benchmarking?\
(3) In future work, it is mentioned that lossless compression may serve as a pretraining target before being fine-tuned for specific downstream tasks. If, for example, the task is forecasting, should MSE be used for fine-tuning, or should lossless compression continue to be used as the objective? |
Fully human-written |
|
Lossless Compression: A New Benchmark for Time Series Model Evaluation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposed lossless compression as a new rigorous benchmark for evaluation of the generative capacity of time series models. The key insight is to leverage Shannon’s source coding theorem, making the optimal compression length as equivalent to the negative log-likelihood under the true data distribution and providing a principled, information-theoretic evaluation. An open-source evaluation framework "TSCom-Bench" is introduced to facilitate standardized lossless compression assessment for time series models. Extensive empirical studies, including qualitative diagnostics are conducted.
The paper argues convincingly for the equivalence between model compression performance and proper probabilistic sequence modeling, drawing on well-established results from information theory and elegantly formalizing these arguments.
The benchmarking framework is thoughtfully constructed, with standardized data encoding, bijective mapping, and rigorous, transparent protocols for model-to-coder interfaces and evaluation metrics.
The experiments cover a spectrum of real-world and synthetic time series, as well as cross-modality compression tasks.
The baselines focus on neural or general-purpose compressors, but top-performing, highly specialized time series compressors from the literature are omitted, like Sprintz (Blalock 2018) and Chimp (Liakos 2022). Therefore, it undercuts the authors' claims that compression reveals weaknesses in learning-based models overlooked by classical compressors. The above mentioned two compression methods may be more competitive in realistic and complex settings.
There is limited discussions of model adaptation for compression. The current process for adapting forecasting or classification models as compression backbones is only briefly addressed. It remains unclear whether architectural or training changes are required for strong compression performance, and whether these models are truly evaluated in their canonical forms or extensively tuned for the compression setting.
The experiments only cous on a selected suit of datasets that are (noted also by the authors) already highly compressed. The generalizability to raw, heterogeneous, or challenging real-world time series is less well validated.
The derivations rest on strong assumptions: 1) bidjective mappings, 2) negligible quantization loss. The practical impact of floating-point quantization or rounding error in real-world applications is insufficiently quantified. Claims that this is “negligible” demand more explicit empirical or analytical support, perhaps via ablation on quantization precision.
The paper strongly advocates for lossless compression as a “pretraining” backbone but does not provide direct experiments in which models trained for compression are transferred or fine-tuned for other tasks (e.g., forecasting/classification), nor is there a clear discussion on transferability or domain shift.
The bijection and entropy claim is impossible. The paper states "Let $f:\mathbb{R}^d\to A^k$" be a bijective encoding function ... if $f$ is bijective, then the Shannon entropy is preserved: $H(X)=H(S)$". A bijection from an uncountable set $\mathbb{R}^d$ to a finite alphabet sequence $A^k$ cannot exist, and even setting cardinality aside, using discrete Shannon entropy $H(\cdot)$ for a continuous $X$ is not well-positioned. (The correct object would be differential entropy $h(\cdot)$ which is not invariant to general transforms). The appendix later distinguishes an “ideal bijection” vs. practical quantization but it does not change the fact that the statement above is **mathematically incorrect** as written.
Appendix A.1 proves MI invariance under a bijective differentiable transform and concludes that “the mutual information $I(x_t;x_{<t})$ is perfectly preserved.” However, the default benchmark uses quantization to bytes which is not a bijective differentiable map and, in general, reduces mutual information. So the A.1 conclusion does not apply to the default setting.
Methods in section 5.1 prescribe a canonical pipeline that re‑serializes values into IEEE‑754 bytes in a fixed order (i.e., raw uncompressed float bytes). If followed, the original container format should be irrelevant.
The main text says “Encoding length equals the negative log-likelihood.” That’s only true up to a small overhead and any probability/CDF quantization used by the coder. The appendix (A.3) correctly adds $NLL(S)\leq L(S) < NLL(S) +c$, but the body text should echo that qualification to avoid overclaiming.
**The theoretical novelty is low**. The equivalence between code length and NLL and the MDL view are classical. The paper applies well‑known results to time‑series.
**The engineering novelty is moderate**. Defining a standardized, pluggable time‑series compression evaluation with a model‑to‑coder interface and reporting cross‑dataset results is practically useful. **This is the main contribution.**
References
Blalock, Davis, Samuel Madden, and John Guttag. "Sprintz: Time series compression for the internet of things." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2.3 (2018): 1-23.
Liakos, Panagiotis, Katia Papakonstantinopoulou, and Yannis Kotidis. "Chimp: efficient lossless floating point compression for time series databases." Proceedings of the VLDB Endowment 15.11 (2022): 3058-3070.
Can the authors provide a direct, head-to-head comparison with top domain-specific lossless time series compressors (e.g., SPRINTZ, Chimp)? Would their inclusion change the interpretation of modern neural approaches’ strengths/weaknesses?
To what extent do models need to be adapted (e.g., via retraining or specific architecture changes) to be suited for lossless compression tasks, or can standard benchmark implementations be used directly in TSCom-Bench without risk of overfitting/hyperparameter leakage?
Could the authors quantify the empirical impact of floating-point quantization precision on the measured compression ratio, perhaps through concrete ablations or by varying encoding fidelity in TSCom-Bench? |
Heavily AI-edited |