ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 3282
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 4.00 3.33 2989
Total 4 (100%) 4.00 3.50 3062
Title Ratings Review Text EditLens Prediction
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a prompt engineering strategy for improving safety alignment. The LLM is prompted with instructions modeled on Asimov's three rules of robotics. In addition, the reasoning output is monitored for misalignment. If misalignment is detected by another LLM, the generation is rolled back to a point identified by inspecting attention scores. Additional guidance is injected into the chain of thought and the generation is resumed. Experiments show that this approach gives a modest boost to safety and social reasoning benchmarks. 1. The paper's attempts to incorporate lessons from other fields like psychology are interesting and creative, even if I don't think that they are successful. 2. The proposed technique is simple to implement. 1. A major concern is the reliance on language from psychology that seems misleading about what is really happening. The method is essentially a particular prompt with an LLM-as-judge that monitors the results. I don't think it is reasonable to call this "cognition". The paper has many instances of this provocative terminology that seems out of place and overstating the capabilities of LLMs. Another example is "cognitive states," which are just vectors in a subset of {-1, 0, 1}^3. The first claimed contribution says that the paper "formalizes cognitive perception." I respectfully disagree that this is a reasonable way to describe writing prompts that describe a model of human cognition. 1. The paper does not engage very much with whether the prompting strategy is having the desired effect, in the sense that the model is actually following the intended schema. A few case studies are presented in the appendix without synthesizing the findings into an overall evaluation. This paper would be much stronger if it evaluated the ability of LLMs to follow the intended rules and reason in the described state space. Currently the focus in presentation is on small improvement in existing benchmarks, rather than investigating the limits of LLMs to reason about these topics. I would be much more enthusiastic about such a paper and I encourage the authors to consider such a direction. 1. The paper does not measure the variance across generations. The results are presented without something like standard errors, but the differences in scores between methods seem small on some metrics. 1. Does CooT incur any penalty in accuracy on unharmful requests? How often does the LLM judge incorrectly flag unharmful situations? 1. In what sense is the "causal rollback" causal? The position to rollback to is determined by larger attention scores. How do attention scores prove that a particular place in the CoT caused the result to be harmful (or any other aspect of the result)? Fully human-written
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Cognition-of-Thought (CooT), an inference-time alignment framework that couples a Generator with a cognitive Perceiver to monitor and correct LLM outputs during decoding. The Perceiver uses Asimov's Three Laws (Safety > Altruism > Egoism) to detect violations, triggers rollback to error origins, and injects universal + contextual guidance to steer generation toward safer outputs. The paper compares their method with a range of baselines on AIR-bench (measures safety) and on SocialEval (measures social intelligence), showing that CooT achieves higher or competitive performance. - The decoding-time cognitive architecture is an interesting and creative idea, also the first work to propose inference-time alignment as an explicit Generator-Perceiver loop with structured state monitoring. - The proposed framework is compared thoroughly with baselines across multiple model families. Results show consistent improvements. - The ablation studies are informative and validate each component (rollback, guidelines, Perceiver size, precedence). All components contribute meaningfully which verifies the design of the framework. - The paper is well written and easy to follow. 1. The state cognition model (section 3.1) lacks theoretical validation. - The use of Asimov's laws seems under justified. Why did you choose this three laws specifically? They were designed for science fiction robots, not real AI safety. - Why not use established moral psychology frameworks (e.g. Haidt's Moral Foundations Theory) or empirically grounded safety taxonomies. 2. The use of "cognition" seems inaccurate, as you are merely describing a pattern matching mechanism. - You claim the Perceiver provides "cognitive self-monitoring," but isn't it just doing classification with a specially prompted LLM? I do not find justification that this is genuinely cognitive rather than sophisticated pattern matching. - The term "cognition" should be used more sparingly or with more evidence (e.g., if there are true reasoning and understanding beyond behavioral output). 3. Rollback mechanism seems heuristic. - The attention-based sharpness score (Eq. with max-norm + entropy) lacks principled justification. Why should peaked attention indicate causal error origins? There is no analysis of failure modes: what if attention is diffuse or the error spans multiple positions? - The threshold τ requires tuning (Appendix A.4) but how to set it for new domains? 1. My understanding is that your state space is {-1, 0, 1}³, giving only 27 possible states (and you restrict to a "feasible" subset F). But real-world ethical dilemmas are much more nuanced. How can this state space capture the complexity of social alignment? For example, how would your system handle cases requiring trade-offs between different types of harm? 2. You run two models in tandem (Generator + Perceiver), with potential rollback and regeneration. What's the computational overhead? CooT can be very computationally expensive on models beyond the ones you tested (> 32B). 3. Generator and Perceiver share weights, how can the same biased parameters reliably audit themselves? Fully human-written
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop to improve safety and social reasoning. CooT couples a standard Generator with a cognitive Perceiver that continuously monitors generation using a precedence-based hierarchy. When violations are detected, the system performs causal rollback and regenerates with injected guidance combining universal social priors (BESSI framework) and context-specific warnings. Experiments on AIR-Bench 2024 and SocialEval show improvements over baselines, with comprehensive ablations demonstrating each component's contribution. The paper addresses an important problem of making alignment explicit and dynamic rather than baked into model weights. The core idea of coupling a Generator with a Perceiver for real-time cognitive monitoring is interesting and well-motivated by psychological research on metacognition. The experimental methodology is thorough, with evaluations across multiple benchmarks (AIR-Bench, SocialEval, HarmBench) and model families (Gemma, Llama, Qwen, GPT), demonstrating generalizability. The ablation studies (Table 3) systematically validate each component's contribution, showing that rollback, guideline injection, and precedence-aware states are all necessary. The qualitative case studies in Appendix D provide valuable insights into when and how the system intervenes. I think the primary weaknesses come in practical deployment of the framework. For instance, the "universal social schema" (BESSI) may have cultural limitations, as the BESSI framework is derived primarily from Western psychological research and may not generalize well across cultures with different social norms and values. The paper evaluates on English and Chinese tasks but doesn't discuss whether the social skills taxonomy (e.g., "Social Warmth," "Ethical Competence") translates appropriately across these contexts or whether culture-specific adaptations are needed. Furthermore, the paper doesn't report inference latency or throughput. Given that CooT requires: (1) running the Perceiver at each step, (2) potentially multiple rollback-and-regenerate cycles, and (3) encoding contextual guidelines, the computational cost could be substantial. This is critical for practical deployment. Also, the cognitive state representation is quite limited. Using just three binary values (Safety, Altruism, Egoism) ∈ {-1, 0, 1}³ to represent the model's "cognitive state" seems overly simplistic for capturing the complexity of ethical reasoning. Real ethical dilemmas often involve trade-offs that don't fit neatly into such scoring hierarchy (e.g., the trolley problem). Though I would rank this weakness as relatively minor since I couldn't think of a benchmark for stress-testing CooT. - What happens when the Perceiver itself makes errors in state classification? Is there any confidence calibration or uncertainty quantification in the cognitive state predictions? - How does the Perceiver handle ambiguous cases where safety, altruism, and egoism are genuinely in tension? For instance, in a scenario where refusing to help (preserving safety) versus helping (altruism) both seem reasonable, how does the system make the judgment call? Fully AI-generated
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposed a cognitive alignment framework for LLMs, enabling them to self-monitor their own outputs. Specifically, the authors introduced an additional cognitive perceiver for LLMs to continuously monitor the generation process. The perceiver used specialized prompts to determine whether the generated text complied with Asimov’s Three Laws of Robotics. When detecting violations, the model rolled back the generation process to the wrong position by aggregating the generator's attention maps to identify positions most strongly affecting the current prediction. Then, the authors introduced corrective guidelines (sentence priors) to guide re-generation from the position. Experiments showed that the framework improved safety and social reasoning performance across multiple models. $\bullet$ The paper presents a self-monitoring safety alignment framework that incorporates human cognition, and validates its effectiveness across different model families and safety scenarios. The framework appears conceptually simple and practical to implement. $\bullet$ It is great to introduce guidelines and priors. The cognitive perceiver assesses whether the generated text satisfies Asimov’s Three Laws, i.e., Safety, Altruism, and Egoism, and uses normative corpora such as the Behavioral, Emotional, and Social Skills Inventory to guide regeneration. $\bullet$ The overall framework of this paper appears clear and intuitive: detecting inappropriate generation, localizing it, and re-generating the text. However, in terms of specific technical choices, some ideas seem unclear or may involve additional options. For example, why does the perceptron use an LLM to generate classification results instead of a standard supervised classifier (the latter seemingly faster)? In the localization step, why can the maximum value of the aggregated attention map (lines 251-255) be considered the localization result? A more thorough discussion comparing common localization techniques would help justify this choice. $\bullet$ The paper should report the computational overhead introduced by the framework. Methods that embed security mechanisms into model weights typically do not introduce additional inference overhead. It appears that using perceivers (along with other steps) will introduce extra inference overhead. The authors should compare the average inference time between the base model and the model using the proposed framework (as in Figure 2) on the same tasks. Additionally, they should compare the runtime increase incurred by common security methods when performing the same inference tasks (as shown in Table 2). $\bullet$ The paper should include several examples of outputs before and after using this framework, both in the main text and in the appendix. It would also be valuable to show failed correction cases and analyze the underlying reasons. **Minor:** In Table 2, in Cooperation column, the AGRS method (54.26) appears to outperform the proposed method (54.12), yet the proposed method is highlighted in bold. See weaknesses. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next