ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 3 (75%) 3.33 3.67 2956
Fully human-written 1 (25%) 8.00 4.00 897
Total 4 (100%) 4.50 3.75 2442
Title Ratings Review Text EditLens Prediction
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper investigates whether the personality of large language models can be traced to identifiable transformer circuits. Using tools from the mechanistic interpretability community, the authors identify a small set of sparse nodes in a small, pretrained LLaMA model (LLaMA-2-Chat, 7B) that are responsible for generating answers on the proposed Trait-Trace dataset. Ablation studies and causal-intervention analyses show that certain nodes within these circuits can substantially influence LLM performance on the Trait-Trace task. - The perspective of identifying interpretable circuits in Transformer models that are causally responsible for personality-like behaviors is interesting and could have important implications for safety, alignment, and the development of better chatbots. - The paper is generally well written and easy to read. - A major weakness lies in the evaluation. The study relies on a newly proposed Trait-Trace dataset generated by GPT-4o that focuses on single-word reactions to vignettes/trait prompts. All circuit-discovery and causal-intervention experiments depend on this fragile single-word reaction task. It is unclear what the task actually measures—the discovered circuits may merely capture distributional shifts in certain personality-related words rather than any higher-level notion of personality in LLMs. Generalization tests are essential. For example, under causal interventions/steering, do circuits discovered on Trait-Trace transfer to more complex settings (e.g., dialogue generation, storytelling, or psychometric evaluation items)? Given the authors’ access to trained psychology graduate students, such evaluations seem feasible. Demonstrating this would better justify the claim that the identified circuits reflect personality rather than confounding word-distribution shifts. - The Trait-Trace task design is too simple. The template “I’m {p}, regarding {s}, I feel very {r}” biases lexical and affective choices, making the discovered circuits specific to particular word choices rather than to general personality constructs. It remains unclear what construct this task is evaluating. - Limited conceptual insight. As framed, one could likely find circuits or sparse subgraphs for almost any language-model behavior. The authors should better demonstrate—or at least discuss—why these circuits matter and why the discovered early-layer MLP features align with human intuitions. If you replace the prompt with random fillers unrelated to personality, would the intervened circuits still induce a similar shift in the output-logit distribution? Lightly AI-edited
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper explores whether personality traits (based on the Big Five model) can be localized as identifiable “circuits” within large language models (LLMs). The authors construct a synthetic dataset, TRAITTRACE, containing prompts that express high or low levels of each trait and corresponding trait-consistent reactions. Using Edge Attribution Patching with Integrated Gradients (EAP-IG), they identify minimal subgraphs within the model that preserve performance on trait-consistent response prediction. Results suggest that small, sparse circuits can reproduce the full model’s behavior, that high and low levels of traits share many nodes but differ in edge directions, and that early MLP layers serve as bottlenecks for trait information. This paper tries to move beyond behavioral probing toward the mechanistic interpretability of social/psychological constructs. ## Weaknesses and Suggestions ### 1. The motivation is weak. The authors justify this work through an analogy with neuroscience, arguing that personality traits in humans arise from neural circuits and therefore may also emerge as “trait circuits” in LLMs. However, this analogy is conceptually flawed. Human traits are latent psychological dimensions, not localized neural entities, and the connection to artificial circuits is purely metaphorical. --- ### 2. Ethical statement is missing. Because the paper draws direct analogies between human brain circuits and model activations, it risks **anthropomorphism**, suggesting that LLMs “possess” personality traits or human-like psychology. Such framing requires careful ethical consideration and a clear disclaimer, but no ethical statement is provided. The authors should explicitly acknowledge these limitations and clarify that their findings do not imply genuine human-like cognition. --- ### 3. Experimental rigor is low. Only two small instruction-tuned models (LLaMA-2-7B-Chat and Phi-2) are tested, without including base or larger models. This makes it difficult to assess whether the findings generalize across training phases or scales of LLMs. Including additional models or verifying whether similar trait circuits emerge in non-chat variants would significantly strengthen the paper. Also, I understand that this paper's goal is to discover the circuits lying in the LLMs. But to evaluate the quality and validity of the dataset, I recommend that authors provide the evaluation on other methods, such as pure prompting. --- ### 4. Prompt and task design are conceptually flawed. The Big Five traits are continuous spectra, but the dataset reduces them to binary self-descriptions (e.g., “I am high in openness” vs. “I am low in openness”). This introduces strong lexical cues and risks capturing superficial associations between trait names and responses rather than genuine trait inference. A more realistic approach would involve inferring traits from open-ended essays or autobiographical texts. This method is one of the canonical methods to evaluate the personalities of humans [1]. For methodological reference, see [2]. --- ### 5. Novelty is limited. The technical approach, combining edge-attribution patching with pruning, is a direct application of existing interpretability methods. The main novelty lies in dataset curation, but the dataset curation is not rigorous enough. --- ### 6. Causal interpretation is overstated. In Section 6.3, the authors conduct causal intervention analysis. However, the key question, whether these circuits truly represent personality traits rather than lexical correlations, remains unresolved. Without ruling out such confounders, it is premature to claim that the identified subgraphs mechanistically encode traits. [1] McAdams, Dan P. "Narrative identity." Handbook of identity theory and research. New York, NY: Springer New York, 2011. 99-115. [2] Suh, Joseph, et al. "Rediscovering the latent dimensions of personality with large language models as trait descriptors." arXiv preprint arXiv:2409.09905 (2024). 1. **Evaluation details.** The details of the evaluation are not fully provided. The words or phrases like "procrastinating" are divided into 5 tokens. And the LLM response can be like a sentence, but how authors check the overlap between the references and LLM-generated tokens is not fully explained. 2. **Dataset details.** Please provide more details of the curated dataset to evaluate the quality. Lightly AI-edited
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper discusses personality in LLMs, pioneering the application of mechanistic interpretability to analyze the models themselves. The authors discovered that the identified circuits are functionally complete, structurally sparse, and heavily dependent on early MLP layers, which act as causal bottlenecks. This is a thoroughly analyzed paper. While it does not introduce a novel methodology, it rigorously analyzes and investigates personality circuits. This approach uncovers many phenomena previously unknown to the community and refutes the conjecture that "personality is a globally diffuse property." The paper's definition of personality is oversimplified. I do not believe the Big-Five model is sufficient to encapsulate personality, which could also include, for example, dark personality traits (e.g., the Dark Triad) or aspects such as values, beliefs, and motives. The "personality circuits" identified in this paper actually correspond to "Big-Five trait circuits," rather than the broader "personality circuits" as claimed. The authors should conduct further analysis on these aspects to reach a more definitive conclusion. It is difficult for the paper to prove that these personality circuits are exclusive; they are very likely key components that are also involved in executing other semantic tasks. The study was conducted on two relatively small and older LLMs. It remains unknown whether the conclusions can be generalized to state-of-the-art models, especially MoE-based architectures. An analysis of models like Qwen-3-30B-A3B and Qwen-3-235B-A22B would significantly enhance the paper's contribution. The template used to identify personality is highly structured. It is unclear whether similar phenomena persist in more realistic, user-focused tasks such as free-form conversation, extended dialogues, or long-form writing. Lightly AI-edited
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. the paper studies whether personality in LLMs may similarly be realized through structured internal computation paths. The authors come to the conclusion that "only a compact set of attention heads and MLP units appears necessary for encoding and expressing each trait across different models". The research is on a very timely and interesting topic. The results are convincing and should be interesting for a broad community of researchers. Though authors correctly mention that LLMs simulate certain behaviour or personality they do antropomorphize LLMs in other parts of the text. This is unfortunate but minor problem in my opinion. How general is your approach and would it be applicable to bigger models? In particular how can we be sure that the conclusions will hold in the models that are order of magnitude bigger and go through a significantly longer post-training phase for alignment? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next