|
FastVGGT: Fast Visual Geometry Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper aims to solve the computational bottleneck of the SOTA model VGGT when processing long image sequences. The authors identify the bottleneck as computational redundancy in the Global Attention layer, a phenomenon they term "token collapse." To address this, the paper proposes FastVGGT, a training-free token merging framework. Its core is a token partitioning strategy specifically tailored for 3D geometry tasks (preserving the reference frame, retaining salient tokens, and uniform sampling). Experiments show that this method achieves a 4x speedup when processing 1000-frame sequences while simultaneously improving reconstruction accuracy and reducing error accumulation.
The practical value of this paper is extremely high. It proposes a training-free, plug-and-play method that solves a critical performance bottleneck in the SOTA model VGGT. Achieving up to 4x inference speedup makes large-scale 3D reconstruction more feasible for practical applications.
Most impressively, the method not only accelerates inference but also significantly improves accuracy on long sequences. By mitigating error accumulation, it outperforms the baseline VGGT on both pose estimation and 3D reconstruction tasks, which is a very strong and non-trivial contribution.
The method demonstrates good originality. Its three-part token partitioning strategy (preserving the reference frame, retaining salient tokens, uniform sampling) is specifically designed for the characteristics of 3D geometry tasks (like cross-frame correspondence and a global coordinate system), rather than just being a simple application of standard 2D token merging.
W1: Insufficient discussion of the performance trade-off on short sequences. The paper's strategy and hyperparameters (e.g., 90% merge ratio) are clearly optimized for long sequences (e.g., 500-1000 frames). However, on short sequences (100 frames), the performance is mixed, even performing slightly worse than the baseline on some key metrics (e.g., RPE-rot and RPE-trans in Table 5). The authors acknowledge this in Appendix E, but a deeper analysis is missing from the main experimental section.
W2: Disconnect between motivation and implementation for "salient token" selection. The motivation for "salient tokens" is described as preserving unique, "keypoint-like" features, and the paper mentions that a "top-k based on token norms" strategy was considered. However, a simple "fixed-stride sampling" was ultimately used for efficiency. The claim that both are "comparably accurate" is unsubstantiated (no data in ablations), which creates a disconnect between the final design and its original motivation (to preserve the most unique features).
W3: Robustness to "reference frame" quality is questionable. The entire method relies heavily on the first frame as the "world coordinate system" and global anchor, designating all its tokens as high-priority dst tokens. The paper does not discuss at all how the model would perform if this first frame is of poor quality (e.g., blurry, occluded, or unrepresentative of the scene). This reliance on a single, arbitrarily chosen frame could be a robustness liability.
W4: Evidence for "token collapse" is primarily qualitative. The paper's core motivation—the "token collapse" phenomenon—is primarily argued via visualizations of attention maps in Figure 3. While these maps look similar, this is only a qualitative observation. The paper lacks a more rigorous quantitative analysis (e.g., calculating average cosine similarity or feature variance between tokens) to prove the degree of redundancy, which weakens the force of its core motivation.
W5: Justification for the fixed-stride sampling strategy. Using "fixed-stride sampling" to select "salient tokens" seems like an arbitrary choice. "Keypoints" in a scene are unlikely to be uniformly distributed on an image grid. This sampling method runs counter to the motivation of preserving the "most unique" features and is potentially less effective than the (W2-mentioned) top-k strategy.
Q1 : You determined an aggressive 90% merge ratio starting from Block 0, based on 500-frame sequences (Table 7). Is this fixed hyperparameter combination optimal for all tested sequence lengths (100, 300, 1000 frames)? Could the slight performance drop on 100-frame sequences (Table 5 RPE) be resolved by using a more conservative (lower) merge ratio? Does this suggest that an adaptive merge strategy (e.g., merge ratio $r$ as a function of sequence length $N$) would be a better choice?
Q2 : Could you please provide a specific quantitative comparison of accuracy (CD/ATE) and inference time between (a) the top-k based on token norms strategy and (b) the final fixed-stride sampling strategy for "salient token" selection? (You claimed they were "comparably accurate" in Section 3.2). This is crucial for justifying your final design choice.
Q3 : Your method is highly dependent on using the first frame as a global reference. Did you test the method's robustness to the quality of this first frame? For example, how much does FastVGGT's performance degrade if the first frame is blurry or heavily occluded? Have you considered alternative reference selection strategies, such as choosing the sharpest frame in the sequence (via some quality metric) as the reference, rather than blindly using the first one?
Q4 : For your "token collapse" motivation, did you perform any quantitative measurements beyond the qualitative visualizations in Figure 3 to confirm this phenomenon? For example, by calculating the average cosine similarity between different token feature vectors in the Global Attention layers and showing that it increases in deeper layers? This would provide stronger support for your motivation.
Q5 : The most exciting result in the paper is the improvement in accuracy on long sequences, which you attribute to "mitigating error accumulation." Could you please explain mechanistically, in more detail, why removing tokens (even redundant ones) helps prevent error propagation? My hypothesis is: by merging redundant information in early layers (starting from Block 0), the model is forced to form a more compact and robust global representation, which prevents "over-thinking" on noise and redundant features in later layers, ultimately reducing drift. Do you agree with this hypothesis, or do you have other insights? |
Fully AI-generated |
|
FastVGGT: Fast Visual Geometry Transformer |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper concerns an efficient extension to VGGT, targeting the attention bottleneck in the original design. The authors perform a thorough analysis and find redundancy in the token representations. The authors call this "token collapse", where similarities for clusters of tokens end up becoming more or less indistinguishable, representing a clear redundancy. The authors exploit this by implementing structural merging via an explicit partitioning strategy. While the underlying ideas for reducing the complexity is not entirely new; the paper introduces specific mechanisms that seem well aligned with the objective. The authors highlight that their method achieves a $4\times$ speedup on 1000 image sequences, all while maintaining reconstruction quality.
1. The paper solves an important computational challenge with VGGT training and inference, which is of significant interest.
2. The methodology and experimental setup is well designed, clearly written, and comes across as technically competent engineering.
3. The technical contribution is intuitive, and well aligned with the goals the authors establish. Redefining the merging cycle for multi-view consistency, anchoring to the first frame and protecting salient tokens comes across as thoughtful and insightful modelling choices.
On the whole, the contributions are well motivated and the paper comes across as a well presented, with clear applications for a prominent contemporary modelling framework.
1. While the work solves an important problem with VGGT, the work is somewhat incremental; though in a technically competent and practically valuable sense. This is a minor, high level weakness, but has some implications, detailed below.
2. The proposed merging mechanism is presented as a novel architectural contribution, but has close parallels in existing literature. The authors mention the cosine-based assignment and replication in ToMeSD [1]; an extension of ToMe [2] for diffusion. However, other approaches such as TokenLearner [3], SPiT [4], and ATC [5] are other works that apply similar mechanisms for perceptual grouping and exploiting redundancy, explicitly setting out to capture partitions with constraints of connectivity in 2D images. More tangentially, the averaging and landmark selection procedure can be said to correspond to Nyström-like kernel compression and feature sketching [6; 7], and can also arguably related to general low-rank approximation. This reviewer emphasises that while the paper’s integration of these techniques into 3D feed-forward geometry transformers is valuable, the current phrasing makes the work come across as a somewhat *isolated discovery*. In this reviewers opinion, the contributions of the paper would be clearer if the authors explicitly positioned the contribution within the existing body of work on the topic.
3. Relatedly, an analysis on how the constrains used in the work (e.g. first-frame anchoring, salient protection) differ / synergise with existing works would benefit the field of token merging. Discussing how / why "geometry-aware" settings align / demand this specific adaptation would also be an interesting addition, and strengthen the work; particularly if the work is contrasted with existing methods.
4. The paper is a little sparse in discussing the inevitable failure cases and practical limitations, as well as intuition for hyperparameters (merging ratio). The effect of merging naturally induces a loss of information, and failure cases (if applicable) would not be unnatural, nor surprising given the improved efficiency of the approach.
5. Certain notation could be clarified slightly. L314 has examples of "src" and "dst" that is written out in plain text, which makes the notation hard to parse when placed in the prose. This is a minor weakness, but addressing this would improve the readability of the manuscript.
---
[1] [Token Merging for Fast Stable Diffusion, Bolya & Hoffman, 2023](https://arxiv.org/abs/2303.17604)
[2] [Token Merging: Your ViT But Faster, Bolya et al., 2022](https://arxiv.org/abs/2210.09461)
[3] [TokenLearner: Adaptive Space-Time Tokenization for Videos, Ryoo et al., 2021](https://openreview.net/forum?id=z-l1kpDXs88)
[4] [A Spitting Image: Modular Superpixel Tokenization in Vision Transformers, Aasan et al., 2024](https://arxiv.org/abs/2408.07680)
[5] [Agglomerative Token Clustering, Haurum et al., 2024](https://arxiv.org/abs/2409.11923)
[6] [Using the Nyström Method to Speed Up Kernel Machines, Williams & Seeger, 2001](https://papers.nips.cc/paper_files/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html)
[7] [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention, Xiong et al., 2021](https://arxiv.org/abs/2102.03902)
1. The authors uses the term “geometry-aware” liberally, but the mechanism operates entirely in feature space. Anchoring to the first frame and preserving salient tokens is plausible, but not proven to maintain geometric invariants. This reviewer is left wondering what the authors mean when stating that the model is "geometry-aware", and whether the authors considers this worthwhile defining more clearer in the manuscript?
2. The method’s efficiency stems from token merging, yet comparisons are made only to 3D reconstruction baselines, not to efficient-transformer methods like ToMe, Nyströmformer, or even Performer [8]. The question is then; why not compare against generic attention approximations? Of course, this may not be trivial, but a discussion around why such a comparison is not included would be prudent.
3. The merging ratio is treated as a fixed hyperparameter with minimal analysis. Do the authors have any insights to share with practitioners who would like to tune this in more specific applications?
4. The paper reports mean metrics but is missing a failure analysis. Are errors gradual or catastrophic? Do high-parallax / low-texture regions induce failure compared to the original? Does merging induce smoothing, or loss of information, as one would expect?
While not as much a question, but this reviewer emphasises that the manuscript would be strengthened by extending the related work section to include previous works on partitioning for reducing token redundancy.
---
[8] [Rethinking Attention with Performers, Choromanski et al., 2020](https://arxiv.org/abs/2009.14794) |
Fully human-written |
|
FastVGGT: Fast Visual Geometry Transformer |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes FastVGGT, a method to speed up Visual Geometry Transformers (VGGT) when handling long image sequences.
The authors find that the Global Attention layer causes heavy computation. To solve this, they introduce a training-free token pruning framework that keeps only the most useful tokens.
FastVGGT uses three steps: keeping the first-frame tokens for stability, preserving important tokens for details, and using region-based random sampling for balanced coverage.
Experiments show that FastVGGT achieves up to 4× faster inference and reduces error accumulation, making it more efficient and robust for long-sequence 3D vision tasks.
1. The paper tackles an important and practical problem — improving the inference efficiency of VGGT on long image sequences.
2. The proposed training-free token pruning strategy is simple yet effective. It significantly reduces computational cost while maintaining nearly the same accuracy.
3. The identification of the “token collapse” phenomenon in VGGT’s Global Attention is insightful.
4. The paper is clearly written and well-structured, with intuitive motivation and clear methodology.
1. The improvement is very limited. As shown in Tables 2, 3, and 4, FastVGGT shows almost no accuracy gain compared to VGGT*, and in some cases even performs slightly worse. This weak improvement reduces the overall significance of the contribution.
2. A straightforward expectation is that FastVGGT should reduce memory usage during long-sequence inference by pruning redundant tokens. However, the paper does not provide any experimental evidence or quantitative analysis on memory consumption. As a result, it remains unclear whether FastVGGT actually achieves significant memory savings when performing global attention over long sequences.
3. Comparisons are limited. Most baselines are just VGGT variants. In Table 4 the authors report Pi3 as OOM; it would be more informative to implement an optimized variant of Pi3 (analogous to VGGT*) so a direct, fair comparison is possible.
1. In Tables 2, 4, and 5, the meaning of the bold numbers is unclear. If bold font is intended to indicate the best performance, there are several cases where FastVGGT performs worse than VGGT* but is still highlighted. This inconsistency creates confusion and raises concerns about the fairness and clarity of result presentation.
2. The claimed advantage in handling long sequences is not clearly demonstrated. For example, in Table 2, the reconstruction quality with 1000 frames is even worse than with 500 frames, which contradicts the paper’s stated goal of improving long-sequence performance. This raises doubts about whether FastVGGT truly provides benefits for very long input sequences. |
Moderately AI-edited |
|
FastVGGT: Fast Visual Geometry Transformer |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
FastVGGT accelerates VGGT's 3D reconstruction inference by applying training-free token merging to the Global Attention bottleneck. Through strategic token partitioning—preserving reference frames, retaining salient tokens, and uniform sampling—it achieves 4× speedup on 1000-image inputs while maintaining reconstruction quality and mitigating error accumulation in long sequences.
1. The method achieves substantial 4× speedup on 1000-image inputs while maintaining or even improving reconstruction quality, and effectively mitigates error accumulation in long-sequence scenarios. This addresses a critical bottleneck in feed-forward 3D reconstruction models and demonstrates clear real-world applicability.
2. The approach requires no model retraining or fine-tuning, making it immediately deployable to existing VGGT systems. The comprehensive experiments across multiple benchmarks (ScanNet, 7 Scenes, NRGBD) validate its effectiveness and generalizability across diverse 3D reconstruction tasks.
1. This paper suffers from insufficient experimental details and lacks fair comparisons. Critical hyperparameters are inadequately documented: while the ablation study (Table 7) examines merging ratios of 0.3, 0.6, and 0.9, the main experiments (Tables 2-5) do not explicitly state which configuration was used.
2. More problematically, Table 1 presents baseline comparisons with naive merging strategies (random and fixed-stride sampling) but provides inconsistent experimental conditions. The random sampling uses merging ratios (r=0.5, 0.8) while fixed-stride sampling uses stride values (s=5, 8), these setting do not align with any setting of the proposed method. Furthermore, what is the testing data? The resolution? The number of input frames? The paper provides no explanation for why certain configurations were chosen or how they relate to the proposed method's settings. This makes it nearly impossible to assess whether the improvements stem from the proposed partitioning strategy or simply from different hyperparameter choices. A fair comparison would require evaluating all methods under equivalent compression rates with clearly documented settings.
3. The technical contribution of this paper is limited. The core proposal consists of three token partitioning strategies: (1) designating first-frame tokens as destinations, (2) retaining high-norm tokens as salient, and (3) applying region-based random sampling. While these choices are reasonable for 3D reconstruction tasks, they represent relatively straightforward engineering adaptations rather than fundamental algorithmic innovations. The token merging mechanism itself is directly borrowed from existing 2D vision methods (ToMe, ToMeSD) without architectural modifications. The main contribution appears to be identifying where to apply existing techniques rather than developing novel methods.
See Weaknesses. |
Moderately AI-edited |