ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 4.00 4.50 5354
Heavily AI-edited 1 (25%) 6.00 2.00 1058
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 2.00 4.00 2944
Total 4 (100%) 4.00 3.75 3677
Title Ratings Review Text EditLens Prediction
Protein as a Second Language for LLMs Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Protein-as-Second-Language (PSL) framework is a method for protein function understanding using large language models (LLMs) without fine-tuning. The approach reformulates amino acid sequences as symbolic language and uses adaptive, bilingual context construction (sequence + natural language) to enable zero-shot reasoning. A curated dataset of ~80k protein–QA pairs spanning functional, descriptive, and reasoning tasks supports the method. - Introduces the idea of treating protein sequences as a "second language" for LLMs, bridging symbolic biological and natural language reasoning. - The bilingual corpus is diverse (79k QA pairs across 4 types), functionally rich, and biologically balanced. - Works across frozen LLMs (3B–14B) and improves both open-source and proprietary models in zero-shot settings. - No wet-lab or structure-level validation is presented; success is only measured by text-based QA metrics (e.g., ROUGE-L). - How does the method perform on out-of-distribution or rare protein families, especially those absent from the QA corpus? Heavily AI-edited
Protein as a Second Language for LLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work introduces a novel question-answering (QA) dataset focused on protein expression, localization, mechanism, and interaction. The authors also propose a retrieval-based framework that enables pretrained, generic large language models (LLMs) to analyze unknown protein sequences using an in-context learning approach. By including similar proteins and their corresponding descriptions in the prompt, the paper reports an average 7% improvement in the ROUGE-L score on QA tasks for the target unknown protein. * This work proposes a framework that eliminates the need to train or fine-tune a task-specific LLM for protein sequence analysis. * The paper employs dual criteria to retrieve similar proteins for augmenting the prompt, considering both sequence and text similarity. * The overall novelty of the paper appears limited. Prior works, particularly ProtEx from Google DeepMind, already explored the possibility of in-context learning for biological entity analysis. This submission however, neither mentions nor compares with them. * The similarity-based retrieval likely limits the generalization ability of the model on unseen protein sequences that are very different from existing ones in the database. A 70% threshold is applied on sequence similarity when constructing the dataset and retrieving for the prompt, but novel or orphan proteins often share <40% identity to the existing database. Please address my two major concerns in the “Weaknesses” section first. I will reassess after the rebuttal. Other miscellaneous questions are as follows: * Were the quality and correctness of the augmented QA from DeepSeek-R1 verified somehow? Teacher LLMs are known to hallucinate, especially on complex scientific topics like biology. * Are GO annotations an effective criterion for grouping and redundancy reduction? As far as I understand, a group of proteins might be very different from each other even though they share similar GO annotations. This is especially the case when some proteins are not very well annotated and have very few GO annotations. * The automatic evaluation solely relying on the ROUGE-L score is most likely not sufficient. Other metrics like accuracy (particularly for True or False QA), BLEU score, and BERTScore might provide a better understanding of the improvement in performance. * Besides, a human evaluation is conducted in the paper, but the domain knowledge or expertise of these evaluators is not mentioned (maybe I missed it), making the reliability of this evaluation questionable. * The term “zero-shot” used multiple times in the paper is a bit misleading, because retrieved similar proteins in the prompt may provide contextual information. The framework described in the paper is more likely a “few-shot” one. * Numbers in Figure 4 (left) are not consistent with the bars. I would recommend a thorough double-check of all the numerical results presented in the paper. Fully human-written
Protein as a Second Language for LLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces the "Protein-as-Second-Language" framework, which aims to enable large language models (LLMs) to interpret protein (amino acid) sequences as if they were acquiring a second symbolic language. By curating a bilingual dataset of almost 80k protein-question-answer triples and implementing an adaptive context construction mechanism, the approach supplies protein sequence–question–answer exemplars as in-context learning cues for frozen LLMs. Extensive experiments on multiple protein-text QA benchmarks demonstrate that LLMs guided by this framework substantially outperform their zero-shot counterparts, and in some cases, even domain-specialized, fine-tuned protein LLMs. 1. The paper cleverly reframes protein sequences as a "second language," allowing general-purpose LLMs to build protein-function mappings in a zero-shot regime. This paradigm bridges symbolic biological and natural languages, bypassing the need for task-specific fine-tuning or retraining and creatively leverages LLMs' in-context learning strengths. 2. A substantial and well-constructed bilingual (protein-natural language) dataset is curated, addressing redundancy at both sequence and annotation levels. Figures 1 and 3 illustrate a careful, step-wise reduction of redundancy with diverse coverage across species, protein families, superfamilies, and ontology categories. 3. The method is evaluated on multiple datasets (ProtDescribe, Protein2Text-QA, Mol-Instructions) with frozen, general-purpose LLMs (Qwen, GPT-4o, Mistral, Kimi) and compared against strong protein-LLM baselines (BioT5-plus, ProLLaMA). Main results (Table 1) and multiple figures demonstrate robust performance gains in both automatic (ROUGE-L) and human evaluations, frequently surpassing or matching fine-tuned specialized models. Weaknesses 1. The methodology’s underpinnings, particularly the bilingual context construction mechanism (Section 3.2), lack precise mathematical formalization. While Equation-based thresholds (for sequence and annotation deduplication, Section 3.1.1/3.1.2) are provided, the procedure for query-to-context matching (how candidate exemplars are scored and integrated, and any ranking or aggregation formulae) is only described at a high level. For instance, does selection use hard thresholds or similarity-weighted composition? What is the precise mathematical form of the query-context similarity metric for textual and sequence components? Without a formal definition of, for example, the joint scoring function or aggregation, it is difficult for others to re-implement or to affirm the reproducibility or generalizability of the context construction process. Explicit equations or algorithms are expected for a submission at this technical level. 2. While the empirical evaluation is extensive by the standards of current protein-language model research, critical baselines are missing: - Direct comparison with leading analogy/reasoning-augmented LLM paradigms, such as those leveraging knowledge graphs (e.g., ANALOGYKB[1]), hierarchical retrieval (e.g., BeamAggR[2]), or reinforcement-learning-based self-correction (e.g., SeRL[3], AlphaEdit[4], Self-Correct RL[5]), is not present. - There is also a missed opportunity to benchmark efficiency: the computational cost of in-context "second language" understanding (which requires substantial prompt assembly with many exemplars) is never compared directly to the one-time fine-tuning cost of protein LLMs, or to parameter-efficient adaptation approaches (e.g., S-LoRA[6], MoDeGPT[7]). Without this, claims of scalability remain qualitative. 3. The central claim—LLMs can generalize protein understanding more efficiently through contextual exemplars—is only validated empirically under a restricted set of Q/A regimes. There is no attempt to analyze theoretical properties: e.g., under what assumptions does the contextual analogy paradigm guarantee generalization or compositional reasoning for proteins? What are failure modes if distributional shifts exist (out-of-distribution proteins/annotations not covered by exemplars)? The paper could benefit from at least some basic analysis or a discussion of the limitations. references [1] Yuan, S., Chen, J., Sun, C. (2024): "ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base" [2] Chu, Z., Chen, J., Chen, Q. (2024): "BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering" [3] Fang, W., Liu, S., Zhou, Y. (2025): "SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data" [4] Fang, J., Jiang, H., Wang, K. (2025): "AlphaEdit: Null-Space Constrained Model Editing for Language Models" [5] Kumar, A., Zhuang, V., Agarwal, R. (2025): "Training Language Models to Self-Correct via Reinforcement Learning" [6] Wu, Y., Piao, H., Huang, L. (2025): "S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning" [7] Lin, C., Gao, S., Smith, J. S. (2025): "MoDeGPT: Modular Decomposition for Large Language Model Compression" 1. Can the authors formalize the context construction mechanism, especially the mathematical definitions of exemplar ranking, weighting, or aggregation (e.g., is there a score function for context selection, or are choices made heuristically)? Please clarify with explicit pseudocode or equations. 2. What is the full breakdown of annotation QA pass/fail in the 5% discarded portion of the dataset? Are there any systematic biases or edge cases in the rejected data? 3. Did the authors attempt to benchmark model inference/prompt assembly time versus fine-tuned protein LLMs, or estimate resource requirements (memory/latency) for large context window assembly in practical use? 4. Could the authors compare directly with analogy-driven or multi-hop retrieval LLM frameworks (such as ANALOGYKB, BeamAggR) and efficient adaptation methods (S-LoRA, MoDeGPT)? 5. Can the authors clarify the apparent error in the human-in-the-loop Krippendorff's $\alpha$ (is 0.72% a typo?), and provide more detail on the distribution of human evaluations by task/model? Fully AI-generated
Protein as a Second Language for LLMs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes Protein-as-Second-Language (PSL), a training-free framework that enables large language models to interpret protein sequences as a “second language.” Instead of fine-tuning, PSL performs retrieval-based in-context learning by constructing bilingual contexts that pair amino-acid sequences with natural-language descriptions. The authors build a 79K protein–QA corpus via Gene Ontology–based functional grouping, MMseqs2 clustering with semantic deduplication, and automatic QA generation using DeepSeek-R1 across four question types. During inference, PSL selects relevant examples based on sequence homology and semantic similarity, forming adaptive prompts for frozen LLMs (GPT-4o, Qwen, Mistral). Across three benchmarks (ProtDescribe, Protein2Text-QA, Mol-Instructions), PSL achieves up to 17.2% ROUGE-L improvement, outperforming domain-specific models like ProLLaMA-7B and BioT5+, and reframes protein understanding as retrieval-driven bilingual reasoning rather than supervised fine-tuning. This paper introduces a conceptually novel and computationally efficient framework that enables large language models to understand protein sequences through bilingual contextual reasoning without any fine-tuning. In addition to the framework, the authors construct a large-scale bilingual protein–text corpus containing 79,926 sequence–question–answer pairs, which serves as the foundation for retrieval-based in-context learning and systematic evaluation. 1. The bilingual corpus is constructed using Swiss-Prot as the primary data source, while the evaluation datasets are also derived from or highly overlap with Swiss-Prot. The paper does not provide sufficient details on how potential data leakage or overlap was prevented, which raises concerns about the fairness of evaluation. 2. Each inference involves a retrieval step to construct query-specific contexts, but the computational overhead and latency introduced by this process are not analyzed. The practical efficiency of the framework therefore remains unclear. 3. The method assumes that proteins with high MMseqs2 similarity share similar functional or semantic contexts. However, this assumption may not always hold, especially for multi-domain proteins. A more critical discussion or ablation on this assumption would strengthen the justification. 4. The experimental comparison includes only two domain-specific baselines, ProLLaMA-7B and BioT5+, which may not be sufficient to establish broad effectiveness. Including more diverse or fine-tuned protein LLMs could improve the reliability of the conclusion. 5. The framework appears to treat protein sequences and text jointly as input without a dedicated modality projector or alignment module. While this simplifies the design, it may not fully exploit cross-modal complementarities, and more structured feature integration could further enhance performance. 1.Could the authors clarify whether the constructed bilingual corpus overlaps with the evaluation datasets? Since both the corpus and the benchmarks seem to originate from Swiss-Prot or related sources, it would be important to specify how potential data leakage was prevented to ensure fair evaluation. 2.The paper transforms Swiss-Prot annotations into multiple QA formats rather than using the full annotations directly. What is the motivation for this choice, and would incorporating broader and more complete biological knowledge lead to more stable contextual enhancement? 3.The proposed framework involves a retrieval step for each query to build adaptive bilingual contexts. Could the authors discuss the computational overhead introduced by this process and its impact on inference time and scalability compared with fine-tuned models? 4.The exemplar selection process is described as combining both sequence homology (via MMseqs2) and semantic similarity between QA pairs. Could the authors elaborate on how these two signals are integrated into the final retrieval score or ranking criterion? 5.The method assumes that proteins with high MMseqs2 similarity share similar contexts. Have the authors considered using alternative similarity measures, such as embedding-based similarity from ESM or structure-based similarity from ProTrek, and could they provide related ablation results? 6.In the comparison with ProLLaMA and BioT5+, were these models fine-tuned on any part of the proposed corpus (like RAFT), or were their publicly released parameters used directly for inference? Please clarify whether additional training or adaptation was performed to ensure a fair comparison. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next