ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 4 (100%) 5.00 4.00 3633
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 5.00 4.00 3633
Title Ratings Review Text EditLens Prediction
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes RIVAL, a training-free framework showing that *multi-agent collaboration* can enable smaller open-source LLMs (≤72B) to approach or surpass GPT-4–based systems on long-video understanding. RIVAL has two core components: - Multi-Stage ReAct Planner (MSRP): decomposes reasoning into *OBSERVE → THINK → ACT* stages, with explicit state transitions and a fixed toolset (stop search, delete/add by frame ID, add by text via CLIP), iterating until a quality threshold is met. - Multi-Agent Debate Refinement (MADR): after MSRP forms an initial answer, affirmative and opposing agents debate once per turn (with one tool call each), and a judge selects or revises the final answer; debate stops on agreement, a win, or max rounds. On EgoSchema, RIVAL with Qwen-2.5-72B reaches 66.8% on the subset (SOTA in their comparison; +6.6 over GPT-4 baselines) and 56.4% on the full set. With Qwen-3-32B, it reaches 65.0/57.2 (subset/full). On NExT-QA, RIVAL attains 74.4% (72B) and 73.2% (32B) on validation and 66.5% / 63.7% on ATP-Hard, surpassing prior GPT-4–based agent methods in the reported comparisons. The system also processes a 28-hour concatenated long video under limited compute (≤15k token context; dual A100s for 72B), arguing for privacy-preserving, resource-constrained deployment. - Shows that *careful orchestration* (MSRP) plus *adversarial verification* (MADR) can reduce dependence on very large proprietary LLMs for long-video QA. - Explicit stage transitions, fixed tool APIs, and stopping rules make the agent loop auditable and easier to reproduce conceptually. - SOTA subset performance on EgoSchema (66.8% with 72B), and NExT-QA gains over VideoAgent/LLoVi in the reported tables. - Maintains accuracy on a 28-hour concatenated video where a single-agent baseline degrades substantially. - Operates within a 15k token window and on commodity accelerators (72B split over 2×A100), aligning with realistic deployment; privacy angle is well-motivated. - Improves upon GPT-4–centric VideoAgent and text-only aggregation like LLoVi, while aligning with the trend toward streaming arbitrary-length video. - Tables focus on GPT-4 baselines circa 2024; it would help to benchmark against the most recent proprietary/open VLMs that handle arbitrary-length streams (e.g., streaming VLLMs) to solidify the “rivals GPT-4” claim. - The pipeline’s quality hinges on CLIP retrieval and the image/video captioners; retrieval bias or caption hallucinations could mislead the debate, and ablations on retrieval quality (e.g., different CLIP backbones, top-k) are limited in the main text. - MSRP/MADR rely on an internal evaluator score (60/40 criteria) and a threshold to trigger debate; while there is some analysis, deeper calibration/robustness checks (e.g., agreement with human judgments, sensitivity to α) would strengthen soundness. - Claims of efficiency would benefit from a cost breakdown: number of tool calls, frames read, average tokens per step, wall-clock latency vs. baselines. Current hardware details are provided, but *end-to-end* throughput comparisons are sparse. - Source code is not yet released (pending security review); although pseudocode and prompts are promised, this limits verification and adoption pre-camera-ready. - Results are strong on EgoSchema/NExT-QA; adding diverse long-video tasks (e.g., instruction following, temporal localization) would clarify generality. 1. How sensitive are results to the accuracy/completeness weights (60/40) and the debate threshold α? Can you report Kendall/ Spearman correlation of evaluator scores with correctness, and success rates per score bin? 2. You set 3 rounds based on a peak at 66.8%. What is the marginal accuracy gain vs. added latency per round on both datasets? Provide a Pareto curve (accuracy vs. seconds/$$). 3. How do different CLIP variants and top-k selections affect accuracy and runtime? Can you quantify failure modes where retrieval misses key evidence? 4. For EgoSchema you use LaViLa with overlap removal; for NExT-QA, CogAgent. Could you provide cross-captioner results and any leakage checks? 5. Can you add a head-to-head vs. recent streaming long-video VLLMs or updated GPT-4-class systems to contextualize “rival GPT-4” beyond 2024-era baselines? 6. Please report average #tool calls, frames retrieved, tokens consumed, and end-to-end latency per query, and contrast with VideoAgent and LLoVi at similar compute. Fully AI-generated
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes RIVAL, a video understanding framework built on small open-source LLMs (≤72B), aiming to rival GPT-4-level proprietary methods. RIVAL consists of (1) MSRP (Multi-stage ReAct Planner) for structured reasoning with explicit sub-states (OBSERVE → THINK → ACT) and tool-calling, and (2) MADR (Multi-Agent Debate Refinement) for adversarial multi-role answer refinement. The system retrieves key frames via CLIP and performs iterative information augmentation plus debate-based correction. Experiments on EgoSchema and Next-QA show strong results, surpassing GPT-4 baselines on subsets, and showing robustness on extremely long (28h) concatenated video. 1. Strong empirical results. RIVAL achieves substantial gains over prior GPT-4-based VideoAgent/LLoVi on EgoSchema subset (+6.6%) and competitive Next-QA performance. 2. Multi-agent debate refinement is effective and well-motivated. MADR empirically corrects initial errors and is demonstrated clearly with case study. 3. Very long video case study is interesting. Handling 28h concatenated input with minimal degradation is a good stress test. 1. Clarity & ablations missing. The current writing does not sufficiently quantify how much performance comes from CLIP retrieval, MSRP decomposition, and MADR debate individually. Ablation will greatly strengthen causal attribution. 2. Significant engineering heuristics. Many parts of MSRP are manually structured and rely on prompt templates / tool definitions — unclear robustness to domain shift or tasks not fitting stepwise logic. 3. Scalability beyond QA not validated. RIVAL is only evaluated on video QA benchmarks; unclear if this paradigm generalizes to open-ended summarization / event boundary detection / reasoning beyond MCQ. 4. Some baselines may not be strictly comparable. For Next-QA, several older entries are pre-CLIP/2024-era; more recent strong open models could be added for fairness. 1. Can the authors include ablations isolating (a) no MSRP, (b) no MADR, (c) no CLIP key-frame retrieval, to quantify contribution of each component? 2. Could the authors report results on free-form open-ended summarization tasks to illustrate generality beyond MCQ style QA? 3. Given that the video is often reduced to textual descriptions, does RIVAL degrade on videos with non-linguistically describable cues (e.g., spatial geometry, implicit physics)? Fully AI-generated
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces RIVAL, a *training-free* agentic framework for long-video question answering that aims to “rethink scale”: instead of relying on very large proprietary models, it orchestrates smaller open LLMs via two modules: -Multi-Stage ReAct Planner (MSRP): enforces explicit OBSERVE → THINK → ACT stages, produces a structured tool-usage plan (Stop Searching; Delete/Add by Frame ID; Add by Text via CLIP), and stops when a score threshold or max steps is reached. This reduces reasoning/action drift and keeps context within ~15k tokens. -Multi-Agent Debate Refinement (MADR): after an initial answer, *affirmative* and *opposition* agents debate with limited tool calls; a judge either declares agreement, a winner, or halts at max rounds. On EgoSchema, RIVAL with Qwen-2.5-72B/3-32B reports 66.8/65.0 on the subset and 56.4/57.2 on the full set, surpassing GPT-4–based baselines in their table. On NExT-QA, it reaches 74.4/73.2 on val and 66.5/63.7 on ATP-Hard. Under an ≈28-hour concatenated-video stress test, RIVAL degrades less than VideoAgent (e.g., on a 1.5B model: 33.8 vs 23.4). Compared to VideoAgent (LLM-tool agent with proprietary backends) and LLoVi (dense captioning + LLM reasoning), RIVAL argues better *privacy* (local open models) and *resource* efficiency; against streaming VLMs it offers a systems alternative grounded in retrieval + agent debate. +Competitive long-video QA with open models under a 15k token budget and modest GPUs. +Enforced OBSERVE/THINK/ACT stages + fixed tools and stop criteria reduce “free-form” LLM drift and make loops auditable. +MADR’s affirmative/opposition/judge structure with Frame-ID/Text queries is a neat twist on multi-agent debate for video evidence-seeking. +Consistent gains over VideoAgent from 0.6B–72B (incl. large margins on EgoSchema subset) and solid NExT-QA results. +Smaller degradation on the 28-hour test relative to VideoAgent supports the scalability story. -Results would be more conclusive with head-to-head against *arbitrary-length* streaming VLMs (e.g., VideoStreaming, StreamingVLM) at comparable compute, not only GPT-4–centric agents. -The end-to-end quality hinges on EVA-CLIP-8B+ retrieval and LaViLa/CogAgent captioners; failure modes (missed key frames, caption hallucination) are not deeply dissected. -The evaluator’s 60/40 criteria and α=5 gate are plausible, but more human-agreement/calibration plots (e.g., ROC/AUC vs. correctness) and sensitivity to α would strengthen soundness. -Hardware is stated, but per-query metrics (#tool calls, frames retrieved, tokens, wall-clock) and *compute-normalized* comparisons vs. VideoAgent/LLoVi are sparse. -The method is validated on EgoSchema/NExT-QA; extensions to open-ended grounding, temporal localization, or instruction following would clarify generality. -The paper reads reproducibly at the concept level, but full code/prompts would be needed for wider adoption; timelines aren’t specified. 1. Could you report accuracy vs. *seconds/tokens/tool-calls* per query (and per MADR round), and compare against VideoAgent/LLoVi at matched budgets? 2. How stable are results under different α thresholds or 60/40 weight splits? Any human-agreement stats (e.g., Kendall τ between evaluator scores and correctness)? 3. What is the impact of swapping EVA-CLIP-8B+ for a lighter/heavier CLIP, or changing Top-k and similarity thresholds? Fail-case taxonomy? 4. Cross-captioner results (LaViLa ↔ CogAgent) on both datasets, plus leakage checks for LaViLa (you mention overlap removal). 5. Can you add direct comparisons to VideoStreaming/StreamingVLM (or other 2025 long-context VLMs) to contextualize where RIVAL wins/loses? 6. Any early results on open-ended QA or temporal localization tasks to probe beyond multi-choice settings? Fully AI-generated
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes U-CSA—an unsupervised cross-modal semantic anchoring framework to match aerial imagery with vector maps. Instead of aligning image↔image, U-CSA first asks a multimodal LLM (Qwen2.5-VL) to produce structured “semantic anchors” (JSON over 11 attributes + a ≤40-word summary) for each image–map pair; then (i) a dual-branch visual encoder is trained with anchored contrastive learning against the text anchors; and (ii) an adversarial matching head with a prototype library refines the decision boundary. The authors also introduce MSTcons, a 18,907-pair benchmark built from WHU (Christchurch) and Inria (Austin, Chicago, Kitsap, Vienna, West Tyrol), with 256×256 tiles and explicit splits. On MSTcons, U-CSA beats unsupervised SAM-MCD and several adapted change-detection baselines in ROC-AUC/F1, with ablations supporting the contribution of anchors and prototypes. Originality. A clean combination of staged ReAct planning with adversarial multi-agent debate targeted at long-video QA; explicit tool APIs (stop, add/delete by frame ID, CLIP text query) make the control flow concrete.   Quality. Strong headline numbers on EgoSchema and Next-QA (incl. per-subset breakdowns) and a long-video stress test (28h). Comparisons include GPT-4/VideoAgent/LLoVi families and same-model re-implementations.    Clarity. The pipeline and roles are well illustrated; termination conditions and planner/debate prompts are spelled out; implementation details (captioner/CLIP/serving) are given.    Significance. If claims hold under controlled settings, the result that smaller open LMs with orchestration can match/beat prior GPT-4-based agents is practically meaningful for privacy/cost-sensitive deployments. Potential option-conditioning bias. Frame retrieval uses both the question and the answer options (Ia = Top-k Sim(I, A)), which risks label-peeking and unfairly advantaging multiple-choice setups. Please add ablations that retrieve only from Q (no options), or retrieve before reading options, and report accuracy deltas. Fair-comparison controls. Several baselines differ in LLM scale, context, and tools. While you re-implement VideoAgent with Qwen-2.5 (Appendix C), the tables still mix methods with non-comparable compute/budgets. Please provide same-LLM apples-to-apples runs (VideoAgent/LLoVi/RIVAL all on Qwen-3-32B with matched token budgets, retrieval limits, and tool calls) and report token & wall-clock costs. Ablation depth on MSRP/MADR. The contribution attribution is under-specified. Add: MSRP→single-stage ReAct; MSRP without enforced state transitions; MADR→self-consistency / majority-vote; debate with no tools; and variance across #rounds/threshold α. Report accuracy, calls/round, tokens, and failures. Reliance on captioner/CLIP & leakage audit. Results hinge on LaViLa/CogAgent (captioner) and EVA-CLIP-8B+ (retriever). You remove EgoSchema overlaps in LaViLa, which is good, but please add captioner/CLIP swaps and a leakage audit across both benchmarks (e.g., retrieve-then-blind the captioner to options; test different CLIP checkpoints). No-options retrieval? What happens if frame retrieval is conditioned only on Q (not options), or performed before seeing the options? Please quantify. Budget & efficiency. Can you report average #tool calls / debate rounds / prompt tokens / latency per question, and compare to VideoAgent/LLoVi under matched settings? Ablation breadth. Could you add MSRP/MADR ablations described above and release prompt templates and judge criteria used for win/consensus decisions? Captioner/CLIP sensitivity. How sensitive are results to swapping LaViLa↔CogAgent (cross-benchmark) and EVA-CLIP↔other CLIPs? Any drop-in if the captioner context is limited? 8-hour scenario. How do you partition the 28h stream internally (sliding windows? chunked debates)? Please report failure modes and variance for the long-video setting. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next