ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 3.50 4324
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 6.00 4.00 5834
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 4.00 4.00 3481
Total 4 (100%) 5.00 3.75 4490
Title Ratings Review Text EditLens Prediction
FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes FACT, which addresses the limitation where the inter-variable dependency should be modelled dynamically over time. The core design of the paper enables: (1) dependency leveraged across both the time and frequency domains; (2) dependency leveraged at different reception fields; (3) Similar to TimesNet, the 1-dimensional signal is transformed into matrices for better capturing the dependencies. **S1:** The research question raised by the paper is both valid and timely for the community. **S2:** The paper is easy to follow, with graphs to show the model \& module designs, and visualised results for straightforward comparison and analysis. **W1:** Inappropriate Averaging Across Horizons. Computing averages across horizons is misleading, as different forecasting horizons are different in forecasting difficulty, where averaging the errors could easily hide poor performance on some horizons. **W2:** The claim (or the assumption) that there is "no inherent order among variables in multivariate time series" (line 290) does not always hold. For example, in weather forecasting, climate indicators at grid points closer to the target by nature have a greater influence than those farther away. Not considering this ordering would limit this work to applications where the covariates are iid, which is OK, as this can be the scope of the paper, but the authors need to make it clearer in the paper. **W3:** Even if the covariates are iid, there is still concern that the reception field of the paper generally decides which variables are taken into modelling, this means that although the variables does not have an order themselves, the model performance would be highly dependent on the ordering as this influence which variables are modelled and which are not for each target variable. **W4:** While the paper claims to have the dependency modelled dynamically over time, it is not clear how, just by looking at the modelling design. See Q3. **W5:** Table 5 shows wildly different configurations per dataset, which raises concern that the good performance presented by the paper might be overfitting to benchmarks rather than a robust general solution. **Q1:** Modelling the dependencies dynamically over time has been proposed, either it is within the same channel [1] or across different channels [2,3], with the dynamics also captured with convolution layers and coarse and fine-grained dependencies captured at different temporal resolutions. Can the authors clarify what the key novelty of the proposed paper is compared to the prior work, and why the advocated way of capturing dynamic dependencies is better? [1] DeformableTST: Transformer for Time Series Forecasting without Over-reliance on Patching (NeurIPS 2024) [2] DeformTime: Capturing Variable Dependencies with Deformable Attention for Time Series Forecasting (TMLR 2025) [3] Adaptive Convolutional Forecasting Network Based on Time Series Feature-Driven **Q2:** The results are averaged across the full forecasting sequence, i.e., [t+1,t+H]. However, this might over-credit the models that are giving better performance on time steps close to t+1, but bad performance when close to t+H, compared to models that do an average job across all time steps. How is FACT compared to baselines when only evaluated on the target forecasting horizon t+H? **Q3:** Could the authors clarify the mechanism or provide more details on how the temporal dynamics of dependencies are captured? Fully human-written
FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes FACT, a novel convolutional architecture for multivariate time-series forecasting. The model aims to capture fine-grained, time-varying interactions between variables. Its core contributions include: 1) a depth-wise convolution module named DConvBlock to capture dynamic interactions; 2) a strategy of reshaping the 1D variable list into a 2D grid to leverage efficient 2D convolutions; and 3) a dual-domain modeling approach combining time and frequency domains. The authors conduct extensive experiments on 12 benchmark datasets, claiming state-of-the-art (SOTA) performance in both prediction accuracy and computational efficiency. On the positive side: Empirically, it achieves SOTA results on 12 benchmarks, which is a strong contribution. Furthermore, its significant improvement in computational efficiency (compared to the $O(N^2)$ complexity of Transformers) holds high practical value and appeal. On the debatable side: The theoretical depth of its core methodology (2D variable reconstruction) needs strengthening. The long-term contribution and guiding effect of this strategy—which trades physical priors for efficiency—on the spatio-temporal data mining field are worthy of further discussion. 1.SOTA Empirical Performance: The most significant strength is that the paper comprehensively outperforms 12 recent models, including iTransformer, across 12 standard benchmarks. This is a very solid engineering achievement. 2.High Computational Efficiency: Compared to the $O(N^2)$ complexity of attention-based models (where N is the number of variables), the proposed convolutional architecture (especially after 2D reconstruction) offers a substantial advantage for high-dimensional variables. The reported reductions in training time and memory consumption (up to 50%) are highly attractive practical features. 3.Clever Design of DConvBlock: The use of depth-wise convolution to assign a dedicated kernel for each time/frequency granularity, as a means to capture "dynamic interactions," is a lightweight and effective engineering design. 4.Thorough Ablation Studies: The authors have conducted detailed ablation studies that demonstrate the necessity of the model's various components (e.g., dual-domain modeling, multi-dilated convolution). 1.Questionable Rationale of 2D Variable Reconstruction: A primary concern is the "2D variable reconstruction" strategy. This index-based reshaping appears to lack prior support. This operation could potentially place physically unrelated variables adjacent to each other while distancing physically adjacent ones. This approach may not fully leverage the valuable physical-topological priors inherent in spatio-temporal data. 2.Omission of GNN Baselines: This is a noticeable omission. The paper claims SOTA on several datasets with clear spatio-temporal structures (e.g., PEMS, METR-LA, Traffic) but fails to compare against Graph Neural Network (GNN) models specifically designed for such data (e.g., DCRNN, Graph WaveNet, MTGNN). GNNs are a standard baseline in spatio-temporal forecasting, and lacking this comparison somewhat weakens the convincingness of the SOTA claims on these specific datasets. 3.Insufficient Theoretical Discussion: As mentioned, the paper does not provide adequate theoretical justification for the "2D variable reconstruction." It relies heavily on empirical results for validation but lacks a deeper theoretical exploration of why this artificial pseudo-topology is effective, which is a slight pity for an ICLR paper. 4.Interpretability Needs Improvement: As the 2D grid is an artificially constructed "pseudo-space," the weights of the 2D convolution kernels may be difficult to map directly to real-world physical interactions, posing a challenge to the model's interpretability. 1.Could the authors further elaborate on the rationale for the "2D variable reconstruction" from a theoretical perspective (beyond just "it works empirically")? Why would an arbitrary, index-based grid be a better representation for variable relationships than the original 1D list or a true physical graph (as used by GNNs)? 2.Why were GNN models omitted from the baseline comparison? Given that PEMS and METR-LA are standard GNN benchmarks, we strongly suggest the authors add comparisons against mainstream GNN models to truly substantiate the SOTA claim. 3.Have the authors experimented with other reshaping methods, such as clustering-based reshaping, or projecting the sensors' true physical coordinates (latitude/longitude) onto a 2D space and then applying 2D convolution on this physically meaningful 2D image? This would at least preserve spatial locality. 4.Permutation Sensitivity: The 2D reconstruction strategy seems dependent on the original index order of variables in the dataset. If the order of variables is permuted (e.g., swapping variable 5 and variable 50), would the model's performance change significantly? 5.Generality of GNNs vs. Necessity of Topology: The authors mention GNNs in the related work but seem to avoid them in comparisons. Do the authors perceive GNN models (which require an adjacency matrix) as lacking "generality," whereas FACT (which does not) is more general? If so, does FACT's SOTA performance on these spatio-temporal datasets imply a deeper conclusion: that physical-topological priors are not necessary for forecasting in these specific benchmarks, or perhaps that their importance has been overestimated? 6.Physical Interpretation of 2D Reshaping in Frequency Domain: For frequency domain modeling, the variable dimension C is similarly reshaped into $\sqrt{C} \times \sqrt{C}$. This operation seems even more abstract. In the time domain, "variable 1" and "variable 11" becoming neighbors is debatable; in the frequency domain, what is the physical meaning of applying a 2D convolution to the "amplitude/phase of variable 1" and the "amplitude/phase of variable 11"? How does the model learn meaningful patterns in such a highly abstract 2D frequency space? Fully AI-generated
FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes FACT, a CNN-based architecture for multivariate time-series forecasting that explicitly models fine-grained inter-variable interactions in both the time and frequency domains. The core module, DConvBlock, uses multi-dilated depth-wise Inception-style 2D convolutions to capture variable interactions at each granularity while keeping computation low. Variables are reshaped from 1D into a 2D grid to enlarge the receptive field with fewer layers; frequency modeling is performed in amplitude–phase form and fused with the time-domain pathway via a learnable weight. Across 12 benchmarks, FACT reports state-of-the-art accuracy and notably reduced training time and memory versus attention, especially on high-dimensional datasets. 1. The proposed method has valid design—depth-wise, multi-dilated 2D convolutions per granularity plus an amplitude–phase frequency path, fused with learnable weight. The 2D reshape argument is analytically motivated (layer-count reduction). 2. The proposed method shows consistent gains on extensive benchmarks; attention comparison shows lower computations on large datasets; ablations isolate contributions of Inception and FFN and the value of dual-domain modeling. 3. Contributions are explicitly listed; the pipeline and fusion formulae are precise. Efficiency improvements matter for high-dimensional multivariate forecasting, making the approach practical. 1. The DFT is applied to the embedding dimension (not clearly the temporal axis). The physical meaning and potential pitfalls (e.g., phase wrapping, noise sensitivity, leakage) are not deeply analyzed. A comparison to performing frequency modeling strictly along time (per variable) would help. 2. Reshaping C variables to an H×W grid makes spatial adjacency arbitrary; while dilations broaden coverage, different permutations/layouts might change outcomes. A learned layout or permutation-invariant design could mitigate inductive-bias risks. 3. The components (depth-wise/dilated convs, Inception-style branching, frequency cues) are each established; the paper’s novelty is primarily in the combination and the dual-domain fine-grained framing. A clearer positioning against ModernTCN/frequency-aware models would help. 1. Is the DFT computed along the hidden/embedding dimension D or along time T? If along D, how should we interpret frequency content in that space, and did you compare to a time-axis DFT per variable? 2. How are variables mapped to the H×W grid (fixed order, padding strategy)? Did you try learned permutations or multiple random layouts with ensembling? Any evidence that adjacency choices matter? Fully AI-generated
FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the FACT architecture for multivariate time series forecasting (MTSF), designed to capture fine-grained dynamic interactions between variables in both the time and frequency domains. Its core component, the DConvBlock, uses depth-wise convolution to model variable relationships individually at each time step or frequency component (termed 'granularity' in the paper). In the frequency domain, it notably processes the signal by separating it into Amplitude (A) and Phase (P), modeling them independently. Furthermore, it restructures the 1D variable dimension into a 2D space and applies multi-dilated 2D convolution, efficiently achieving a global receptive field across all variables with fewer layers. Experimental results show that FACT achieves SOTA accuracy on 12 benchmarks and significantly reduces training time and memory consumption compared to attention mechanisms. 1. Fine-grained Interaction Modeling: Unlike existing models that focus on macro-level relationships of the entire variable set, the idea of capturing dynamic interactions at a fine-grained 'granularity' level (i.e., 'each time step' and 'each frequency component') is highly compelling. The method of assigning independent kernels to each channel (granularity) using depth-wise convolution to achieve this is technically novel and effective. 2. Robust Dual-Domain Design: The dual-domain architecture, which considers both instantaneous interactions in the time domain and periodic resonances/phase relationships in the frequency domain, is very robust. The approach of modeling the frequency domain by separating it into Amplitude and Phase, which allows for more direct physical interpretation rather than just Real/Imaginary parts, is impressive. The ablation study (Table 3) clearly demonstrates that these two domains are complementary. 3. Modularity and Generalizability: The core module, MDInception (the convolutional part of DConvBlock), was shown to improve performance not only within FACT but also when integrated into other baseline models (TimesNet, PatchTST, iTransformer) (Appendix E.1). Notably, it significantly boosted the performance of the variable-independent model PatchTST. This suggests the proposed module can function as a general-purpose solution for capturing inter-variable interactions. 1. Limitation of the 2D Reshaping Assumption: The paper assumes "no inherent order among variables" and restructures the 1D variable list into a 2D grid. However, traffic datasets like PEMS and METR-LA possess an underlying spatial structure (road networks), where the variable (sensor) ID order might reflect adjacency information. This reshaping method, which ignores potential spatial locality, could destroy critical information. There is insufficient discussion on whether this assumption holds true for all 12 datasets. 2. Lack of Interpretability for 'Granularity-Specific Learning': This is a core claim of the paper. However, there is no qualitative analysis (e.g., visualization) to verify what the depth-wise convolution kernels actually learn. For example, it is not shown whether specific kernels (channels) in the frequency-domain DConvBlock truly specialize in 24-hour periodicity versus 7-day periodicity, or whether time-domain kernels learn to capture interactions between specific variable pairs (like in Fig 1b) or specific time lags. 3. Questionable Efficacy of A/P Transformation: In Appendix D (Table 6), the paper compares the proposed Amplitude/Phase (A/P) separation modeling with the Real/Imaginary (R/I) alternative. However, the results from both methods are nearly identical (e.g., PEMS03 MSE 0.194 vs 0.193, METR-LA MSE 0.704 vs 0.706). The A/P method requires additional non-linear operations (arctan, cos, sin), whereas R/I is a simple separation. Contrary to the paper's claim of being "consistently the best," the data does not show a clear advantage for the A/P method over the R/I method. This weakens the justification for choosing the more complex A/P representation. 1. You assume no inherent order among variables for 2D reshaping. For datasets like PEMS or METR-LA where actual spatial adjacency between sensors is critical, did you verify if the mapping method (e.g., row-major, z-order curve) from 1D variable IDs to the 2D grid impacts performance? 2. To support the core claim of 'granularity-specific learning,' could you provide visualizations (e.g., kernel weights, activation maps) to demonstrate that the learned DConvBlock kernels actually capture specific frequency bands or dynamic relationships between variable pairs? 3. In Appendix D (Table 6), the performance advantage of the amplitude/phase separation method over real/imaginary or complex convolution methods is not significant. Beyond 'physical interpretation,' were there any practical advantages (e.g., faster computation, better training stability) to adopting the amplitude/phase method over the simpler real/imaginary method? 4. The paper uses a 5-layer DConvBlock. Is there a specific justification or ablation study for setting the number of layers (N) to 5? A sensitivity analysis of performance and receptive field changes based on the number of layers seems necessary. 5. You used the Inception module as the convolution backbone, but Appendix C shows that standard 2D convolution ('+Conv') also achieves competitive performance. How does the multi-branch structure of Inception compare to standard 2D convolution in terms of parameter count or FLOPs? Is the performance increase sufficient to justify the additional cost? 6. You used a modified Inception as the backbone for DConvBlock. Did you consider combining the depth-wise convolution with a mechanism that explicitly models relationships between channels (i.e., 'granularity' in this paper), such as a Squeeze-and-Excite (SENet) block? Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next