ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 3 (75%) 5.33 3.67 2228
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 2.00 3.00 3570
Total 4 (100%) 4.50 3.50 2564
Title Ratings Review Text EditLens Prediction
Exploring Cross-Modal Flows for Few-Shot Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper reframes PEFT as a one-step adjustment problem that fails on difficult datasets and proposes Flow Matching Alignment (FMA), which learns a velocity field to iteratively transport image features toward their ground-truth text features. It introduces coupling enforcement to preserve class correspondence, noise augmentation to combat data sparsity and manifold collapse, and an early-stopping solver (ESS) that classifies from intermediate states to avoid late-stage drift. FMA is plug-and-play across CLIP and multiple PEFT backbones and shows consistent gains on 11 benchmarks, especially on difficult datasets, with ablations supporting each component. The diagnosis of one-step PEFT limitations is convincing; the method is simple, modular, and effective across backbones; experiments are comprehensive; and the early-stopping insight is well-supported by empirical phenomena. Lack of formal guarantees for coupling assumptions, reliance on validation to set inference steps, missing comparisons with higher-order ODE solvers, coarse difficulty metric, and limited analysis of compute trade-offs and failure modes. 1. Can ESS be made adaptive without validation tuning (e.g., stop on sufficient logit margin, small velocity norm, or diminishing logit gains)? 2. How is σ(xt) chosen in practice, and how sensitive are results to the noise magnitude/schedule? Any benefit from uncertainty- or density-adaptive noise? 3. Did you try higher-order or adaptive ODE solvers (Heun/RK) to reduce truncation error and improve margins at the same step budget? 4. Is fixed pairing too rigid for multi-modal classes? Would transporting toward multiple positive prototypes or a class subspace help? 5. Have you considered adding a discriminative loss on intermediate states (e.g., contrastive/margin) to align transport with classification throughout the trajectory? 6. Can you report detailed overhead (velocity net size, train/infer time, average ESS steps) and scaling with number of classes? Fully AI-generated
Exploring Cross-Modal Flows for Few-Shot Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a new framework called Flow Matching Alignment (FMA) to improve feature alignment between visual and textual modalities in few-shot learning. The authors observe that current parameter-efficient fine-tuning (PEFT) methods—such as prompt tuning, adapter tuning, and LoRA—perform only one-step feature adjustments, which are insufficient for complex datasets where image–text features are highly entangled. FMA leverages the multi-step rectification ability of flow matching by learning a cross-modal velocity field that iteratively transforms image features toward corresponding text embeddings, achieving finer alignment. To ensure stability and correctness, the method incorporates three key designs: coupling enforcement (to maintain class correspondence), noise augmentation (to mitigate data scarcity), and an early-stopping solver (to prevent over-transformation during inference). Experiments across 11 benchmarks and multiple backbones show that FMA consistently outperforms existing PEFT methods. 1. FMA introduces flow matching to few-shot learning. By formulating traditional PEFT methods as one-step updates, FMA enables more precise iterative alignment between visual and textual features. As argued by athe uthors, FMA better handles entangled multimodal distributions, especially in challenging datasets. 2. The framework is architecture-independent and can be integrated with various pre-trained vision-language models (e.g., CLIP, CoOp, LoRA) without altering their internal structures. 1. The multi-step flow matching process requires iterative training and inference, which increases computational cost compared to traditional one-step PEFT methods, potentially limiting scalability for large datasets or real-time applications. 2. The method relies on carefully chosen parameters such as the number of inference steps, step size, and noise schedule. Suboptimal tuning can lead to degraded performance or instability during alignment. Especially when flow matching originates from generative modeling, and its adaptation to supervised classification tasks lacks rigorous theoretical grounding, in terms of convergence and optimal stopping criteria. 1. As discussed in the weakness part, is there any theoretical guarantee of convergence and stability of FMA? Given that flow matching originates from generative modeling, what are the theoretical conditions under which FMA ensures convergence to the correct class-aligned distribution in supervised learning settings? Fully AI-generated
Exploring Cross-Modal Flows for Few-Shot Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenge of achieving precise cross-modal alignment in vision-language models (VLMs) for few-shot learning. The authors argue that existing parameter-efficient fine-tuning (PEFT) methods—such as prompt tuning, adapter-based, and LoRA-based approaches—perform only a "one-step" adjustment of features, which is insufficient for complex datasets where modalities are highly entangled. To overcome this limitation, the authors propose Flow Matching Alignment (FMA), a model-agnostic framework that leverages flow matching theory to enable multi-step feature transformation. FMA incorporates three key designs: coupling enforcement to preserve class correspondence, noise augmentation to mitigate data scarcity, and an early-stopping solver for efficient and accurate inference. Extensive experiments on 11 benchmarks show that FMA consistently improves performance, especially on challenging datasets, and integrates seamlessly with various backbones and PEFT methods. 1. Novel application of flow matching to cross-modal alignment in few-shot learning, moving beyond generative tasks. 2. Effective design choices (e.g., early-stopping solver, noise augmentation) that address practical challenges in training and inference. 1. No analysis of computational overhead or inference latency introduced by multi-step transformation. 2. Ablation studies do not explore the sensitivity of performance to hyperparameters like inference steps. 3. The early-stopping strategy uses a fixed step count rather than a sample-adaptive criterion, which may limit optimality. 1. Could the author provide more detailed information across different datasets in section 4.2 GENERALIZATION ABILITY, where only average performance was given? 2. Was any exploration done into adaptive early-stopping criteria (e.g., based on feature discriminability) rather than a fixed stepsize? 3. How does FMA perform in cross-modal retrieval or other downstream tasks beyond classification, given its alignment-focused design? 4. Could the authors provide more intuition or theoretical insight into why coupling enforcement preserves class-level correspondence in high-dimensional feature spaces? Fully AI-generated
Exploring Cross-Modal Flows for Few-Shot Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a method to improve alignment performance between modalities in cross-modal models. It claims that existing methods fail to align well on challenging datasets because they attempt one-step alignment, and proposes a multi-step approach to align the embedding vectors of the two modalities. Specifically, it performs flow matching to transform from images to the distribution of text embedding vectors. - This paper is well-written and well-structured. - Alignment of embedding vectors across multi modalities is an important research topic. - Flow-matching alignment seems novel. However, its necessity is questionable, and it might be just a combination of new techniques. - In experiments, the proposed method outperforms baselines on class classification tasks. However, as written in Weaknesses, it is unclear whether the evaluation is well-designed to confirm the claims. - The motivation for multi-step adjustment is unclear. First, the definition of one-step adjustmentfor poor performance in existing methods is ambiguous. For example, is the claim that PEFT's optimization objective function is inappropriate, or that optimization is insufficient due to difficult learning? Figure 2 discusses PEFT's characteristics compared to LP, but the validity of using LP as a baseline for this discussion is unclear. It is unclear how this connects to the statement: “these methods try to adjust their general aligned multi-modal distribution towards the golden distribution by one rectification step.” - The experimental setups are insufficiently described, resulting in a lack of reproducibility. For example, it states that velocity networks are learned, but I could not find a description of the specific structure of the velocity networks. There is no definition of $\sigma^2(\cdot)$. There seems also no report of the number of steps M for the proposed method across each dataset. In addition, there is no evaluation of statistical significance. - The baseline varies depending on the evaluation. While Table 1 compares against 8 baselines, Table 2 has one baseline and Table 3 has five. Specifically, the baseline compared in Table 2 is one of the weaker baselines among those appearing in Table 1. Although there are practical limitations on the number of experiments, comparing against the strongest baseline yields more convincing results. - The proposed method seems computationally expensive. It requires preparing velocity networks and performing multiple updates during inference (Algorithm 2). How does the computational cost compare to the CLIP-Adapter with two linear layers? How does it compare to PEFT? Since the performance improvement over CLIP-LoRA is only 0–2%, the heavy inference cost makes the proposed method less useful. - Minor issues - The space is filled, making it difficult to read. The absence of a single line of space before and after figures and tables, such as the caption for Figure 4, violates the template. - What is the definition of one-step adjustment? Does it mean that the objective function is set once or that the optimization is only one step? Are Fig. 1(b)-(d) optimal embeddings in some sense? - In Fig. 2, isn't it a bit simplistic to conclude that PEFT is weak on challenging datasets based on LP? Couldn't one also conclude that LP is strong on more challenging datasets? - What happens if stronger methods are used as baselines in Table 2? Also, did you check the standard deviation of the results and their statistical significance? - How about the comparison of computational cost? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next