|
Generative Point Tracking with Flow Matching |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces GenPT, a generative point tracker that models multi-modal trajectories for long-range point tracking in videos. Unlike discriminative trackers that regress a single mean and thus struggle under occlusions or appearance changes, GenPT trains a likelihood-based model with flow matching and three key modifications: (i) iterative refinement within each step, (ii) a window-dependent prior, and (iii) a variance schedule tailored to point coordinates. GenPT achieves competitive visible-point accuracy and state-of-the-art occluded-point accuracy.
1. This paper introduces the first generative point tracker trained using a modified flow-matching objective for trajectories, extending generative modeling concepts to the task of point tracking.
2. The authors design three key modules: iterative refinement, window-dependent prior, and variance schedule. These components are well-motivated and thoroughly ablated.
1. Point tracking is inherently a deterministic problem, so a multi-modal approach may not be well-suited for this task.
2. The improvements of this model mainly target occluded points. However, the objective function used in models such as CoTracker3 or other similar approaches is typically L=Huber_loss(predicted point,ground truth point)×is_visible_gt(this point)
In other words, these models are not explicitly designed to predict occluded points.
3. The greedy search strategy requires running the algorithm five times, which makes it computationally expensive and time-consuming.
1. Could you train CoTracker3 with the objective L=Huber_loss(predicted point,ground truth point) and evaluate how much improvement it achieves on occluded points?
2. Do the failure cases tend to cluster around homogeneous textures or repetitive patterns? |
Lightly AI-edited |
|
Generative Point Tracking with Flow Matching |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Generative Point Tracker (GenPT), which reinterprets the iterative optimization paradigm of many of the modern point trackers (such as PIPs, CoTracker3, LocoTrack) as a form of flow matching. The authors claim to bridge point tracking and generative modeling by formulating correspondence estimation as learning a continuous denoising process that maps perturbed query coordinates to target positions. The framework introduces Gaussian perturbations to query points, defines an auxiliary velocity field trained with a flow-matching objective, and evaluates both single-sample and Best-of-N inference strategies. The paper also includes a multi-template tracking extension, performing patch-wise correspondence aggregation inspired by LocoTrack.
- The paper tackles a genuine limitation of current discriminative point trackers, their inability to represent uncertainty and multimodal hypotheses in ambiguous or occluded regions.
- The authors provide comprehensive comparisons across several datasets
### Lack of generative insight
Although the paper positions itself as a generative reformulation of tracking, the actual mechanism remains deterministic iterative optimization under Gaussian perturbation, not a generative process.
- In generative models (diffusion or rectified flow), the model learns to map **pure noise --> data samples**, learning meaningful dynamics along a linear trajectory in data space.
- In GenPT, the model learns **query + noise --> correspondence**, where the starting point already encodes the spatial identity of the tracked feature. The added noise does not represent a generative latent, only a small random offset to an already meaningful input.
- Thus, the flow is effectively a regularized refinement of supervised training, not a learned stochastic trajectory from noise to data.
- Equation 6 changes the standard CoTracker initialization when $l=0$; increasing $l$ simply reduces supervision strength, not adding new semantics.
- In essence, GenPT = CoTracker3 + Gaussian perturbation + renaming of loss, rather than a true flow-matching model.
### Evaluation issues
- The Best-of-N performance gains could stem entirely from multiple inference-time noise injections, not a learned generative diversity. No comparison to a simple CoTracker3 + random perturbation at inference baseline is provided.
- The empirical improvements are small and inconsistent, and the method fails to demonstrate meaningful benefits in standard single-sample evaluation.
### Presentation and clarity
- The notation is excessive, making the method difficult to follow.
### Overall
While the paper explores a creative framing of point tracking via flow matching, it does not deliver genuine generative insight or methodological novelty. The proposed approach is functionally equivalent to noisy supervised fine-tuning of existing trackers, with only minor differences in objective formulation. The results and framing overstate the impact relative to the simplicity of the actual change.
See weaknesses |
Fully AI-generated |
|
Generative Point Tracking with Flow Matching |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces the Generative Point Tracker (GenPT), the first framework to address the Point Tracking problem using a Generative Model based on Flow Matching. Existing Discriminative Models struggle with uncertainty (e.g., occlusion) as they regress to a single mean estimate. GenPT overcomes this by modeling multi-modal trajectories, enabling it to sample several plausible paths in ambiguous situations.
- GenPT can model and sample from multiple plausible trajectory candidates, particularly when tracking uncertainty is high due to occlusion. This translates directly to state-of-the-art tracking accuracy on occluded points.
- The model effectively transitions between probabilistic and quasi-deterministic behavior. While always generative, its prediction variance tightly contracts (becoming nearly deterministic) when the tracked point is clearly visible and uniquely identifiable.
- There is a substantial and recurring performance gap between the Oracle scores (the model's maximum potential) and the Greedy scores (the model's actual performance when relying on its confidence). This fundamental disconnect means the model is poor at judging the quality of the trajectories it generates, limiting the real-world utility of its multi-modality.
- The advertised speed advantage (2x faster than CoTracker3) is strictly limited to generating a single sample. To achieve the demonstrated improvements in accuracy, the 'Best-of-N' sampling method must be used. This process rapidly increases the runtime, often making GenPT slower than its discriminative counterparts, thus sacrificing one of its key efficiency claims for practical performance.
- A significant portion of GenPT's SOTA claim relies on the custom TAP-Vid Sliding Occluder Benchmark introduced by the authors, which is specifically designed to highlight its strength in occlusion handling. While useful, the novelty of the benchmark means the competitive results require independent verification across established, universally adopted benchmarks.
Have the authors explored an adaptive sampling strategy where multiple samples ('Best-of-N') are only generated in windows where the model's initial predicted uncertainty (variance/confidence) is above a certain threshold, rather than sampling N times in every window? |
Lightly AI-edited |