ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (25%) 4.00 4.00 4456
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 4.00 3.00 2874
Total 4 (100%) 4.00 3.25 3270
Title Ratings Review Text EditLens Prediction
DiagVuln: A Holistic Conversational Benchmark for Evaluating LLMs on Vulnerability Assessment Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces DiagVuln, a novel, large-scale conversational benchmark dataset to evaluate the SOTA LLMs on vulnerability assessment. The author aimed to address the gaps in the existing vulnerability detection datasets by curating data from 13 structured and non-structured sources, and developed “Vulnerability Portfolio” with rich context information on the Linux Kernel CVEs. The dataset contains 2000 CVEs across 23 question-answer categories, including detection, localisation, classification, root cause analysis, exploit reasoning, impact assessment, and patch. The key contribution of the work is an automated framework that generates QA pairs from the vulnerability portfolio using RAG. The authors used an LLM-as-a-judge framework with conformal prediction to validate the quality of the automatically generated data. Finally, the paper benchmarks five SOTA LLMs on a subset of the dataset, and the results show that LLMs struggle with complex, reasoning-heavy vulnerability analysis tasks. - Collection of holistic vulnerable data from diverse data sources (structured and unstructured) and curating a large-scale scale rich contextful “Vulnerability Portfolio”, which might also be used for future fine-tuning tasks. - The data collection pipeline is scalable with RAG for QA generation and LLM-as-a-judge for validation. - The results demonstrate the limitations of the current LLMs and provide a future research roadmap for software security. - The CVE Attribution method for the unstructured data solely depends on the regular expressions to extract CVEs. This approach assumes that the unstructured developer discussion and code comments are correctly tagged with a complete single CVE ID, which may not be true in the real-world developers' settings. There might not be a mention of CVE IDs, or the CVE references can be partial, or even the discussion may contain multiple CVE references. How do the authors address this issue? - No mention of the number of samples in the held-out calibration set for the conformal prediction. How did the authors ensure the calibration set size was sufficiently large to represent the whole dataset? - The authors provided a detailed process of human assessment with the required time on 100 QA pairs. However, no quantitative data on this assessment are provided. How to compare the performance of LLM-as-a-Judge against the human annotators? - In the evaluation section, the authors have mentioned that the 200 CVE subset was selected based on the “richest contextual information”. Do the authors consider the potential data leakage issue in this case? How will the model perform against the less documented vulnerabilities? - LLMs have poor capabilities of reasoning about vulnerabilities. Using LLMs to curate vulnerability questions/answers may generate low quality data - What steps were taken in case of missing, ambiguous CVE IDs for the unstructured sources beyond the regular expression? How much does this affect the portfolio curation of less documented vulnerabilities? - The "Correctness" of the RAG-generated answers is validated against the source portfolio. How does the pipeline evaluate for scenarios where the source data itself is incorrect or incomplete? - Is the evaluation result also generalizable to the less documented vulnerabilities? - Among 6,342 CVEs, how many are used to build RAG database, how the RAG database is built? Fully human-written
DiagVuln: A Holistic Conversational Benchmark for Evaluating LLMs on Vulnerability Assessment Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces DIAGVULN, a new multi-turn, conversational benchmark for evaluating the vulnerability assessment capabilities of LLMs. The authors identify critical limitations in existing benchmarks, such as a narrow focus on single data sources, a lack of contextual information (e.g., root cause, exploit details), and a preference for single-turn tasks. To address this, DIAGVULN is constructed by aggregating data from 13 diverse sources into a comprehensive "Vulnerability Portfolio". A Retrieval-Augmented Generation (RAG) system is then used to generate 46,000 QA pairs across 23 categories for 2,000 CVEs. A key part of the methodology is a novel validation pipeline that uses an "LLM-as-a-Judge" calibrated with Conformal Prediction based on human annotations for 100 CVEs, which provides a scalable way to ensure the quality of the RAG-generated answers. The authors use DIAGVULN to benchmark five SOTA LLMs, finding that while they perform moderately on surface-level information extraction, they substantially fail at tasks requiring deeper reasoning, such as mitigation, exploit explanation, and patch analysis. 1. The paper addresses a critical and timely need. As security teams are overwhelmed and LLMs are being integrating into security workflows, a benchmark to measure their true assessment capabilities is urgently required. The paper correctly identifies the flaws of existing benchmarks (single-source, single-turn). 2. The paper's strongest methodological contribution is the validation pipeline. Manually labeling 46,000 QA pairs is infeasible. The proposed solution—using LLM-as-a-Judge and then statistically calibrating this judge using Conformal Prediction (CP) from a small is an elegant and sound approach to building a high-confidence synthetic dataset. It provides a robust guarantee on the quality of the RAG generation process. 3. The "Vulnerability Portfolio" concept, which aggregates 13 diverse structured and unstructured sources (from NVD to mailing lists), is a major strength. It provides the "holistic" context that is missing from other benchmarks and necessary for real-world analysis. 4. The 3-step evaluation procedure (Detection $\rightarrow$ Identification $\rightarrow$ QA) effectively mimics a practical analyst workflow, making the benchmark far more realistic than simple, single-shot QA datasets. 1. This overclaiming of "ground truth" taints the main experiment in Section 5. The benchmark evaluates SOTA LLMs (using web search) by comparing their answers against the RAG-generated answers (from the static portfolio). A low "Correctness" score (Table 6) implies the SOTA LLM is wrong. However, it could simply be a disagreement between the SOTA LLM's retrieval (web search) and the paper's RAG system. The live web search might even be more correct or up-to-date. The paper fails to address this ambiguity, fundamentally weakening the evaluation claims. The benchmark measures deviation from the DIAGVULN-RAG system, not necessarily deviation from objective truth. 2. The paper claims “first conversational benchmark” yet Ruan et al. (VulZoo, ASE’24) already pair CVEs with summaries, exploits and patches; the only novelty is turning these into 23 QA templates. 3. The evaluation pits LLMs against each other, not against (i) classical vulnerability detectors (CodeQL, Clang SA, CPAChecker) or (ii) smaller BERT-style models fine-tuned on existing corpora. 4. Limited Scope of Validation: The human validation (used to calibrate the CP) checked 100 CVEs. While the CP framework provides statistical guarantees, the paper isn't fully transparent about the diversity of these 100 CVEs or how representative they are of the 2,000 CVEs in the final QA set. 1. My primary concern is the "ground truth" claim. Would you be willing to re-frame this? The RAG-generated answers are a high-quality synthetic baseline, not "ground truth." How do you account for the possibility, in your Section 5 evaluation, that a SOTA LLM's answer (from a web search) is correct and your RAG-generated answer (from a static portfolio) is incorrect or outdated? A disagreement does not automatically mean the SOTA LLM failed. 2. The paper aggregates scores across 23 attributes but never shows concrete failure cases. Could you provide a concrete failure analysis, for example, list 10 CVEs where all five LLMs score <3 on Mitigation, and explain the common pattern (missing context, ambiguous patch, etc). Heavily AI-edited
DiagVuln: A Holistic Conversational Benchmark for Evaluating LLMs on Vulnerability Assessment Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new vulnerability dataset including CVEs in the Linux kernel. The raw information was collected from NVD as well as public websites and forums, and RAG was used to create question answer pairs about the vulnerabilities. The quality of the generated QA pairs is validated using human effort. Finally, the paper benchmarks existing SOTA commercial and open-source LLMs on the proposed dataset. - The ability of the RAG method to generate QA pairs is sufficiently validated by the human assessment. - The collected dataset includes useful information that could boost the performance of LLMs, either by using RAG or fine-tuning. - Overall the paper does not well-motivate the need for the proposed dataset. The results section did not show any comparison with previous datasets (e.g., in base LLM performance, or in how training on the dataset improves performance). - It seems natural to evaluate to which extent augmenting an LLM with the knowledge in the proposed dataset (through RAG or fine-tuning) would improve the performance of LLMs in the task. The authors mention this as a potential future work, but in my opinion this should have been included in the paper. - The paper only benchmarks LLMs, without mentioning other vulnerability detection/identification tools. Both the detection and identification tasks could be carried out by either static analyzers or by DNN-based methods. - The related work section did not sufficiently mention existing vulnerability detection and assessment datasets, and how the proposed dataset is different/better. This is currently mentioned in the introduction, but only very briefly - The separate evaluation of vulnerability detection and identification might be misleading. I think all false positives from the detection step should also be considered in the input to the identification step, which should significantly degrade the identification results. - How does your proposed dataset help push the state of the art in vulnerability detection and assessment? - Did you only focus on the vulnerabilities in the Linux kernel project, or do you include any vulnerabilities that affect Linux systems? - Related to the above point, why did you only focus on Linux kernel vulnerabilities? How does this affect the applicability of your dataset? - Why did not you benchmark other vulnerability detection/identification methods (e.g., static analyzers, GNNs, RNNs)? - Could you elaborate more on the developed RAG system used to create the dataset? What was the knowledge base items? what were the queries used? - What was the source for benign code in your dataset? How would having a more balanced (benign vs vulnerable ratio) dataset affect the results in Section 5? - How would unifying the detection and identification pipeline affect the results in Section 5? Fully human-written
DiagVuln: A Holistic Conversational Benchmark for Evaluating LLMs on Vulnerability Assessment Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a conversational benchmark about CVEs from multiple from multiple data sources to evaluate the capabilities of LLMs as cybersecurity assistants in three key areas (vulnerability detection, CVE identification and Q&A about the identified vulnerability). It also introduces a dataset curation and validation pipeline to extend the benchmark. The evaluation on the benchmark with a combination of LLM-as-a-judge and conformal prediction shows significant gaps in current LLMs • It is a valuable benchmark for evaluating the possible extent of use of LLMs in critical cybersecurity situations and the fine-grained questions provide more insight than previous benchmarks. • In addition to LLM-as-a-judge, the authors use conformal prediction techniques to establish high-confidence bounds on the acceptable difference between human and LLM labels. • In Section 5 Prompting Strategy, the authors mention a model-agnostic system prompt, but recent research (https://aclanthology.org/2025.naacl-long.73/, https://arxiv.org/abs/2408.11865) has shown that LLMs are very sensitive to prompt construction. As a result, the assumption that a single prompt will work equally well across all the tested models might not hold true. It will be interesting to see how optimized prompts can work for each LLM, specially the open-weights ones. • The authors do not use structured generation for the model responses even though all the models are capable of structured generation, either through vLLM or their own API. This can reduce the error rates significantly. • While the authors mention that the dataset can be used to fine-tune domain-specific LLMs, it introduces benchmark contamination and any results on the benchmark by a fine-tuned LLM will not be admissible. To do that successfully needs a privately held-out test set. 1. Why was structured generation skipped? It makes for a more structured evaluation as well as improving the capabilities of less-sophisticated models 2. Do you have any plans for a test-set that can be used to evaluate fine-tuned models in the future? Specifically, it would be helpful to have the test set from sources other than the train set. 3. Were any other models evaluated for RAG besides GPT-4o? Since this model is used for both the RAG pipeline and the LLM-as-a-judge, the scores cannot be compared similarly for other LLMs Fully human-written
PreviousPage 1 of 1 (4 total rows)Next