|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TOPOFORMER, a novel framework that integrates topological data analysis (TDA) with transformer based graph learning. The central idea is to capture multi scale structural information from graphs using a topological sequence representation that can be directly processed by a transformer encoder.
The key component, TopoScan, converts a graph into an ordered sequence of topological tokens. Each token represents information extracted from overlapping slices of a graph filtration, such as degree or heat kernel signature. This design enables the model to preserve both local and global topological properties while avoiding the computational cost of traditional persistent homology.
Formally, the authors prove an ℓ1 ,stability theorem showing that small perturbations in the filtration function lead to bounded changes in the generated sequence, ensuring robustness of the representation.
The resulting TopoFormer model combines TopoScan with a transformer backbone to perform graph classification and molecular property prediction. It achieves state of the art or near state of the art results on more than fifteen datasets, including IMDB, REDDIT, MUTAG, PROTEINS, and several MoleculeNet benchmarks.
Originality:
The paper introduces an inventive way to integrate topological reasoning with transformer architectures for graph learning. The proposed TopoScan mechanism is a fresh formulation that replaces persistence diagrams with sequential topological tokens derived from overlapping filtration slices. This approach is conceptually novel because it allows topological information to be represented in a transformer compatible format, overcoming long standing barriers between topological data analysis and deep attention models. The use of interlevel filtrations and the demonstration that they capture multi scale graph structure are creative contributions that extend beyond incremental improvement.
Quality:
The paper demonstrates high technical and experimental quality. The theoretical section provides a sound stability guarantee (the ℓ1\ell_1ℓ1 stability theorem) that ensures robustness to small perturbations, while the empirical results convincingly show strong and consistent performance across both graph classification and molecular property prediction tasks. The authors also perform comprehensive ablations that isolate the effect of filtration type, window size, and token sequence length. Comparisons with a wide set of baselines, including topological and transformer based methods, confirm the reliability of the findings. The method is computationally efficient and well engineered.
Clarity:
The writing is clear, well organised, and pedagogical. The motivation is established early, mathematical definitions are presented cleanly, and diagrams effectively illustrate the construction of topological sequences and their flow into the transformer. The appendix provides sufficient detail for reproducibility. The balance between topological intuition and algorithmic implementation makes the paper accessible to both theoretical and applied audiences at ICLR.
Significance:
The work is significant in both theoretical and practical terms. It provides a general framework for incorporating topological information into neural architectures without requiring heavy persistent homology computation, making it scalable to large graphs. The approach has broad applicability beyond the tested datasets, offering a template for topology aware transformers in other relational domains such as biological networks, material science, and social graphs. By demonstrating that topological structure can be embedded as a sequence and effectively processed through self attention, the paper establishes a promising direction for future research in structure aware representation learning.
Overall:
A technically rigorous, clearly written, and conceptually innovative paper. It combines theoretical insight with practical relevance, resulting in a contribution that meaningfully advances the integration of topology and modern deep learning.
1. Limited theoretical depth beyond stability
While the inclusion of an ℓ1 stability theorem demonstrates that TopoScan is robust to small perturbations, the theoretical analysis does not go far enough to explain why the proposed representation preserves meaningful topological invariants or how it compares in expressive power to standard persistent homology (PH) based methods. The paper would be stronger with a formal comparison of representational capacity, such as bounding the information loss between TopoScan token sequences and PH barcodes, or connecting the proposed construction to existing frameworks like Graph Filtration Learning (Hofer et al., NeurIPS 2020) or Stable Rank Vectors (Chazal et al., 2021). A clearer theoretical bridge between TopoScan and persistent homology would make the contribution more foundational rather than heuristic.
2. Dependence on hand crafted filtration functions
The approach still relies on predefined filtrations (for example degree, curvature, or heat kernel). The choice of filtration has a notable influence on performance, as shown in the ablation studies, yet no adaptive mechanism is proposed. This limits generality and introduces dataset specific tuning. The authors could strengthen the work by introducing a learnable or task aware filtration module, or at least by exploring gradient based parameterisation of the filtration functions to allow end to end optimisation.
3. Limited interpretability analysis
Although the paper claims that TopoScan tokens are interpretable and capture multi scale structures, there is no concrete analysis showing what the model learns. For instance, visualising the transformer’s attention weights mapped back to graph substructures (for example motifs, cycles, or communities) would provide stronger evidence that the model captures meaningful topology. A few case studies or qualitative visualisations would greatly improve interpretability and help validate the topological claims.
4. Incomplete generalisation evaluation
The experiments focus primarily on molecule and social graph benchmarks, which are standard but relatively homogeneous. Testing on non molecular heterogeneous graphs (for example citation networks or dynamic temporal graphs) would help confirm that the method generalises to other graph structures. Since TopoScan claims to be a general topological sequence generator, demonstrating this versatility would make the paper’s impact broader.
5. Efficiency claims lack concrete quantitative support
The paper argues that TopoScan avoids the cubic computational complexity of persistent homology, yet runtime improvements are reported qualitatively rather than quantitatively. Providing explicit runtime tables or scaling plots, such as comparing training and inference times against PH based baselines like PersLay or TopoGCL, would substantiate the scalability advantage and enhance the practical credibility of the method.
1. Theoretical clarification on representation power
Could the authors explain in more detail how the TopoScan representation compares in expressiveness to persistent homology? Specifically, does the sequential encoding preserve the same critical topological information that persistence diagrams capture, or does it approximate it? A small empirical or theoretical comparison could help clarify what kind of information may be lost or transformed in the conversion to token sequences.
2. Learnable or adaptive filtrations
Have the authors considered learning the filtration function directly rather than fixing it a priori? For example, one could parameterise the filtration with trainable weights that adapt during training, similar to Graph Filtration Learning (Hofer et al., 2020). If so, what are the main challenges in integrating such a module with TopoScan, and could it potentially improve generalisation across datasets?
3. Sensitivity to TopoScan parameters
The results depend on the number of slices N and the window width 𝑚 used in TopoScan. Could the authors provide a sensitivity analysis or heuristic guideline for selecting these parameters? Understanding whether performance is robust to parameter variation would increase confidence in the stability of the framework.
4. Attention interpretability and visualisation
Can the authors show examples of which graph substructures the transformer attends to when processing the topological token sequences? For instance, does the attention mechanism focus on regions corresponding to high curvature, cycles, or clusters? Such visual evidence would strengthen the claim that TopoFormer is both interpretable and topology aware.
5. Broader evaluation and generalisability
Would the authors consider evaluating TopoFormer on non molecular heterogeneous datasets such as citation or temporal graphs? Since TopoScan is proposed as a general representation, results on more diverse graph types could demonstrate broader applicability and robustness.
6. Quantitative runtime and scalability analysis
The paper claims that TopoScan avoids the heavy computational cost of persistent homology. Could the authors provide explicit runtime benchmarks comparing TopoFormer with PH based baselines like PersLay or TopoGCL on large datasets? This would substantiate the claim of improved efficiency.
7. Relation to existing topological and transformer models
How does TopoFormer conceptually differ from recent hybrid approaches such as TopoGCL (Zhao et al., 2023) or Graphormer (Ying et al., 2021)? A more explicit discussion of what new design principle TopoFormer introduces beyond these works would help clarify its unique contribution to the literature.
8. Empirical validation of the stability theorem
Can the authors provide an experiment demonstrating the stability property empirically, for example by perturbing the graph structure or filtration and measuring the variation in output embeddings or predictions? Such a demonstration would connect the theoretical result with observable robustness in practice. |
Fully AI-generated |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TOPOFORMER, a framework for graph-level representation learning. The core contribution is "Topo-Scan," a module that decomposes a graph into an ordered sequence of topological tokens .
The authors claim that standard Persistent Homology (PH) suffers from an "early saturation" problem on graphs. They propose Topo-Scan as a lightweight solution. Instead of the standard cumulative filtration, Topo-Scan uses a simple "sliding window" to "slice" the graph . For each slice, it computes basic invariants (Betti-0, Betti-1) . This process creates an ordered sequence fed into a standard Transformer.
The authors claim this simple method solves the "early saturation" problem and achieves SOTA results on graph classification and molecular property prediction
The paper is well-organized and easy-to-follow. It also provides a lot of experimental results to understand the empirical behavior of this method. The Figures and tables are carefully presented. Additionally, the paper is well-motivated, clearly identifying the "early saturation" problem , and its core originality comes from combining topological slicing with a standard Transformer architecture .
The paper's central claims are undermined by a significant gap between its motivation and its empirical results. The core contribution (Topo-Scan) is a conceptually simple modification, but the experiments fail to provide statistically significant evidence that this modification offers a meaningful advantage over the standard methods it claims to improve upon.
1. The core idea is trivial and lacks justification: The paper's main technical contribution, Topo-Scan, replaces the standard cumulative sublevel filtration $V_i = \{v: f(v) < a_{i}\}$ with a slicing window $V_i = \{v: a_{i} <f(v) < a_{i+m}\}$. This is a conceptually simple modification.
2. Though the authors justify the contribution of solving early saturation by showing the empirical Betti-0 comparison with increasing thresholds, the trends only differ at the very late stage (the last 10 steps on IMDB-B and only from the ~72nd to 82nd step on IMDB-M). The relation of this phenomenon with the eventual performance on property prediction tasks is unclear. Table 5 is the key experiment meant to provide justification, comparing TOPOFORMER (Topo-Scan + Transformer) directly against "PH-TR" (standard PH + Transformer). However, the paper's own data shows no statistically significant improvement.
For example, on IMDB-B, where Figure 3 implies that early saturation is addressed by Topo-Scan, the performance with HKS filtration function for PH-TR is (76.8, 3.97) and for TOPOFORMER is (77.9, 5.72). The second's CI almost overlaps with the first one, which gives no evidence to support that Topo-Scan is better than standard PH.
3. The comparison on molecular property prediction is misleading: The results on Table 4 are also based on an unfair comparison. The model used is “TOPOFORMER*”, which is a hybrid model fusing the ECFP fingerprints. However, not all the cited methods used this additional feature. Such comparison exaggerates the advantages. It would be natural to ask whether PH-TR + ECFP gives similar results, but this is not provided either.
4. The paper claims to be "lightweight" and "efficient". However, Appendix C.3 (Table 10) shows that the preprocessing for the SOTA results (using Ollivier-Ricci) takes 339 minutes (>5.5 hours) on OGBG-MOLHIV. It is still a significant cost, and limits the possibility of applying this to larger scale datasets
1. The SOTA claim in Table 4 relies on TOPOFORMER* (which includes ECFP features), while many baselines do not . This is an unfair comparison. To justify the method, can the authors provide results for a "PH-TR + ECFP" baseline?
2. The paper's core efficiency claim is that Topo-Scan achieves "multi-fold runtime... gains" by avoiding PH's "global boundary-matrix reductions" . Can the authors provide experimental runtime data for this specific claim? |
Fully human-written |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a sliding window compared to the standard sublevel set filtration for the computation of persistent homology. The resulting sequence of simplicial complexes (graphs in this case) necessarily form an *ordered* sequence which can be processed with a transformer architecture for downstream tasks.
The sequence of graphs is summarized in a set of topological features (Betti numbers and number of nodes / edges) and used as the input tokes for a transformer architecture, which is a natural choice.
Overall, the idea, though maybe simple, well thought out, natural, elegantly executed and thoroughly evaluated with great care for details. The code is provided, well documented and easily accessible, which is great to see! The reviewer believes that the use of a sliding window for the computation of persistent homology (though no longer persistent perse) is interesting and novel, which is encouraging given the results on the benchmark datasets. This perspective is also interesting beyond the scope of the specific choices made in the paper which is very interesting.
Overall im very positive about this work. The only two weaknesses of the method could be the choice of vectorization scheme (choose a more expressive statistic instead of $\beta_{0}$ or $\beta_1$).
Adding a discussion there could be good. The second minor weak point is that a discussion on the current limitations is missing and some remarks on either future work and the current challenges could be beneficial to the readers. For instance, most methods in TDA have been specifically developed for filtrations and would no longer hold in this scenario.
- Choice of vectorization scheme, as three hard coded numbers. Have you considered using either learned filter functions or more elaborate vectorizations such as persistent images?
- What motivated you to use the Ollivier–Ricci curvature and Heat Kernel Signature? Have you considered other methods as well, such as learning the filtrations? |
Fully human-written |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces TOPOFORMER, a graph representation framework that integrates topological structure into Transformer architectures. The core component, Topo-Scan, generates compact sequential topological tokens by sliding over node/edge filtrations and extracting slice-wise invariants (Betti numbers, node/edge counts). This avoids traditional persistent homology pipelines and costly persistence diagram computations. The authors provide stability guarantees, demonstrate scalability, and report state-of-the-art results on graph classification and molecular property prediction benchmarks.
1. The paper proposes a framework to inject TDA signals into Transformers without persistence diagrams, which articulates limitations of classical PH pipelines (saturation, vectorization design burden) and motivates a practical alternative.
2. The paper provides stability guarantees for the proposed topological sequences, linking back to interlevel persistence.
3. The proposed method avoids heavy PH computation and vectorization. It emphasizes parallelizable slice computation with predictable runtime.
1. Although the paper emphasizes advantages over classical persistent homology pipelines, the empirical comparison against methods that incorporate learnable filtrations remains limited (e.g., Horn et al., 2021; Immonen et al., 2024).
2. While the proposed Topo-Scan framework leverages fixed scalar functions to generate topological slices, the methodology may still exhibit sensitivity to the choice of these filtration signals. The current study provides limited analysis of this dependency, and a deeper investigation—particularly involving learnable or data-driven filtration functions—could strengthen the robustness and interpretability claims.
3. The approach demonstrates competitive performance across multiple graph classification and molecular property prediction benchmarks. However, on large-scale datasets such as OGB-MOLHIV, the framework does not achieve the top reported results, suggesting potential room for improvement in scaling to very large graphs or addressing complex molecular tasks relative to specialized architectures.
4. A few relevant contributions, e.g., [1,2], to learning persistent-homology-based representations are not included in the reference.
[1] de Surrel et al. "RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds." ToGL 2022.
[2] Yan et al. "Neural approximation of graph topological features." NeurIPS 2022.
1. Could Topo-Scan be extended to learn filtration functions end-to-end? How would stability guarantees extend to such a case?
2. For large-scale graphs (e.g., OGB-MOLHIV, REDDIT), what are memory/time trade-offs compared to graph Transformers with pooling?
3. Beyond Betti numbers, would incorporating additional topological descriptors (e.g., persistence landscapes) meaningfully improve results? |
Fully AI-generated |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces TopoFormer, a framework for injecting topological information into graph Transformers by turning a graph plus a filtration into a short, fixed-length sequence of “topology tokens.” Instead of running standard persistent homology and then vectorizing diagrams, the authors propose Topo-Scan: they slide overlapping windows over a node/edge filtration and, for each window, record four inexpensive quantities. Multiple filtrations (degree, HKS, Ollivier–Ricci) produce multiple sequences, which are encoded independently and fused to create a final graph representation. A tailored stability result shows these interlevel Betti sequences change in a controlled way under small perturbations of the filtration, which helps justify the construction.
Empirically, the method is evaluated on 9 graph classification datasets and 7 molecular property prediction tasks. It achieves first or second place on most small/medium benchmarks and remains competitive on the larger OGBG-MOLHIV (refer to questions). The ablations are well designed: they hold filtrations fixed and swap in PH+MLP, PH+Transformer, and the proposed Topo-Scan, which cleanly attributes gains to the sequence-based topological view rather than to feature choice. The authors also report concrete extraction times and argue their approach avoids the “early saturation” commonly seen in PH.
I find the core idea clean and promising. The proposed methods to address real challenges in applying TDA to graphs (global PH + saturation) and nicely enough it does so by simplifying in a well motivated fashion. The resulting sequences naturally fit Transformer architectures and I think this is a clean and reusable concept.
The comparisons to PH+MLP and PH+Transformer under the same filtrations isolate the benefit of the proposed representation rather than conflating it with different signal sources. This supports the central claim.
The empirical results are strong even though some common benchmarks are surprisingly missing. The model is tested on many graph classification datasets and several molecular property prediction tasks, and it is consistently at or near the top on the small/medium ones, showing the idea is not tuned to a single benchmark.
**Architecture**
The part of the paper that introduces the “new” architecture for using multiple filtrations and then combining them at the end is poorly motivated. It is also not clear if you consider this just something you tried or an important part of the contribution of this submission.
Table 12 does not show any convincing evidence that using multiple filtrations in this fashion has any real benefit over using only one filtration. A discussion of this would be important and appropriate where the ablation is mentioned in the main body.
Moreover, the description of the architecture suffers from some over the top phasing. For instance “To enhance generalization and mitigate overfitting, we integrate regularization techniques,
such as dropout and weight decay, ensuring robustness across diverse graph learning tasks.”
For one these techniques of course do not “ensure” anything, additionally they are well established and it should be made clear that these techniques are standard.
**Experimental**
The experiments focus on small benchmarks and none of the commonly studied benchmarks of [1] are considered. Especially in graph transformer literature these might be some of the most standard and common benchmarks that are analyzed, so it is somewhat disappointing to see them missing.
On ogb-molhiv, Table 3 does not represent the state of the art on the dataset, cf., https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-molhiv.
**Minor Notes:**
Please order the columns in Tables 2, 5 and 6 the same.
[1] Dwivedi, Vijay Prakash, et al. "Benchmarking graph neural networks." Journal of Machine Learning Research 24.43 (2023)
At the end of 4.3 you mention that this architecture is particularly well suited for Graph Foundation Models. What makes you claim this? I do not see anything empirically to back this up. Simply working well on multiple scales (which is also only demonstrated in a limited way) is not sufficient reason to make such a claim.
How did you decide on the list of models in Table 3? It seems to leave out many top performing models of the ogb leaderboard (link above).
Are the results in Table 2, and Table 3 all for the exact same filtration setup (OR curve + HKS for Table 2, atomic weight + OC curvature for Table 3)? |
Fully human-written |