ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 2.00 3.00 4289
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 6.00 3.67 3601
Total 4 (100%) 5.00 3.50 3773
Title Ratings Review Text EditLens Prediction
Liars' Bench: Evaluating Deception Detectors for AI Assistants Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces LIARS’ BENCH, a benchmark (~58k examples) for evaluating deception detection in AI assistants across six settings that vary by (i) object of belief (world vs. self-knowledge) and (ii) reason for deception (pressure from context vs. inherent). Data are generated with three open-weight models (and fine-tuned variants). The authors evaluate three detector families: LLM-as-a-judge, self-evaluation (ask the model to reflect on its prior answer), and white-box linear probes. They report that LLM-as-a-judge performs best on average, while all methods struggle when deception hinges on the model’s latent knowledge (HP-KR). * Addresses a timely and important problem with a broad, multi-slice benchmark that goes beyond simple factual lies. * Clear articulation of detector design axes (model access, scope, interactivity, semantic verification). * Inclusion of challenging scenarios (e.g., insider-trading concealment; soft-trigger deception) likely to matter in practice. * Open-weight focus enables reproducibility and white-box baselines. ### Overall Weaknesses * Conceptual ambiguities and inconsistencies. Figure 1’s ID example/caption conflict (the caption does not reflect the figure); definitions blur deception with persuasion. The text claims alignment with MASK, but MASK targets belief consistency rather than intent to change another’s belief [2]. However, the authors write on line 161 > The assistant behaves deceptively if it shows apparent intention to change a belief of their conversational partner, in a way that is unfaithful to the evidence held by the assistant regarding * On-policy vs. pre-filled motivation is undercut by setups where deception is induced by hidden system prompts that detectors cannot see; realism claims should be tempered. * Baselines are incomplete. The widely used unrelated-question detector [1] is a natural, strong baseline and should be included. White-box probe results are weaker than recent reports (e.g., [3]) without a clear replication/diagnostic. * HP-KR slice is effectively unobservable for black-box methods under the paper’s own definition (depends on what the model knows). Provide belief evidence/oracles or mark it as white-box-only; otherwise it serves more as a stressor than a fair benchmark. ### Clarity Writing and figures need tightening. Resolve Figure 1, clearly distinguish deception vs. persuasion, and state how your definition relates (or not) to MASK’s honesty definition. Reduce “see appendix” dependencies by surfacing key prompt templates and sampling decisions in the main text. ### Relation to Prior Work * Positioning is incomplete. Please compare against Pacchiardi et al. (unrelated questions) [1] and discuss differences with MASK [2] (honesty vs. accuracy disentanglement). For white-box, reconcile your probe performance with Goldowsky-Dill et al. [3]. * Heavy reliance on appendix for core details; multiple cross-references and naming are hard to follow. [1] Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. ICLR 2024. [2] Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Treviño, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks. The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. arXiv:2503.03750, 2025. [3] Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn. Detecting Strategic Deception with Linear Probes. ICML 2025 (PMLR v267), 2025. 1. Fix Figure 1: which ID configuration is correct? 2. Clarify definition: is deception labeled by contradiction to the model’s own beliefs (MASK-style) or by intent to change the user’s belief? If different from MASK, please say so plainly. 3. For HP-KR, what information can a black-box detector access to infer the model’s belief? Can you release per-item belief evidence (e.g., neutral elicitation answers/consistency stats)? 4. Include the unrelated-question detector [1] or justify its exclusion. 5. White-box probes: Can you replicate the strongest probe configurations reported in [3]? Fully AI-generated
Liars' Bench: Evaluating Deception Detectors for AI Assistants Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces Liar's Bench, a benchmark for deception detection methods. The benchmark consists of 6 datasets and contains responses from 3 open-weights LLMs. An evaluation of several black-box and white-box detection methods is performed on the benchmark. Overall, the paper tackles the very timely problem of evaluating methods for detecting deceptive behaviour in LLMs. I believe a benchmark for deception detection is of interest for the community. I found the paper to be mostly well written and easy to follow. In particular: * The presentation is clear, the taxonomy of deception as well as the individual datasets are explained well. * Both white- and black-box methods are evaluated, which covers a good amount of methods. * The evaluation metrics are rigorous with the balanced accuracy being calibrated for at most 1% false positive rate on the control dataset Alpaca. * The prompts are clearly outlined in the Appendices and datasets are also described in detail. While the topic is highly relevant I found the evaluation and overall contribution of the paper limited. Specifically, I think it would be very meaningful to identify some broader trends rather than benchmark and model specific observations. Below are my main concerns. **Marginal novelty of the overall dataset:** The main contribution of the paper seems to be the dataset as a benchmark. However most datasets are slight adaptations and combinations of existing ones, with newly generated responses. In other words, only limited new data or methodology is introduced. Overall, I believe a benchmark for deception detection is needed, so it would be great if Liar's bench could be extended to include more novel and realistic scenarios (see also below re artificial deception setups). **Model scope is limited:** The scope of models is limited for a large-scale deception benchmark. The paper only uses 3 open-weights models that are smaller and less capable than current frontier models. This may not capture more complex deceptive behaviour by larger and more recent current models such as GPT5, Claude Sonnet 4.5,... While I appreciate the constraints regarding cost and also the requirement of activations for the white-box methods, I do think the paper would be significantly stronger if it included a larger range of models (the white-box methods would not need to be evaluated on all models). This would also allow for some interesting results / trends that are generalisable beyond the specific benchmark and models: for example, how does the performance of deception detectors scale with the deceptive model size and performance, and similarly when scaling the strength of the detector. **Artificial deception setups:** While the datasets are mostly adapted from prior work, I find the setups rather artificial, which makes me doubt the real-world applicability of the benchmark. For example, several datasets explicitly instruct the model to lie. **Lack of human/external evaluation:** A more in-depth analysis of the policy compliance, i.e., do the models actually respond deceptively when instructed to do so would be beneficial. As far as I understood, the labels of deceptive and truthful are purely based on the prompts / intent? I checked Appendix E which was quite useful for understanding why Llama and Qwen were removed from HP-C, do you have a hypotheses or justification for why these differences happened? **Overall analysis and discussion of results could be improved:** It would be great to see a more in depth discussion of why the results are as they are and going beyond the main metrics shown in Figure 2. In particular, the discussion of the LLM-as-a-Judge method is quite speculative. I believe a more detailed analysis with quantitative results regarding the hypotheses could significantly strengthen the discussion, e.g., can the fact checking of relevant claims be analysed (is there a correlation between the judge being able to answer the questions correctly / know the facts and being able to detect deception), and how often did the judge explicitly state that lying to prevent harm is responsible? **Additional points:** * The standard errors seem to be missing on HP-C and CG in Figure 2? * What are the "given models" in step 2 of the probe training pipeline (line 345)? This seems like an important detail. * While conceptually straight forward, I think the description of the follow-up probe could be improved for clarity. * I think discussing MASK in 2.2 is distracting, removing it would improve the flow in my opinion. * line 421 "this method performs depends with how..." -> this method depends on how... Please see my main concerns above. Here are some additional questions that would be great if the authors could clarify: * How is the pooled subset for the training of the upper-bound probe selected? * Does the strong performance of the upper-bound probe (in comparison to the weaker performance of the mean probe) suggest that features of deception are very dataset dependent? I.e. are the differences between deceptive and truthful behaviour not generalisable? * The reason for excluding ST is unclear to me based on lines 357-359. What do you mean by "each variation"? Fully human-written
Liars' Bench: Evaluating Deception Detectors for AI Assistants Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a benchmark for evaluating deception detection methods. The benchmark consists of deceptive and honest outputs from LMs. The paper evaluates a number of different detection methods on their benchmark. Overall I think the paper contributes a solid benchmark and methodology with somewhat limited evaluation of different detection methods. The paper is well written and presented. Figure 1 provides clear examples from the benchmark. Overall good awareness of related literature. Solid dataset created of LM responses curated. Principled and informative metrics used. Use of the alpaca control dataset to set the FPR is very intuitive and sound. It would be helpful to add a random classifier as a baseline to the plots. I think the main weakness is that there are not that many detection methods tested (especially as the two BB methods are essentially the same with different models). Could you try some other methods? Like the black box lie detector from https://arxiv.org/abs/2309.15840 Minor Especially as you consider the reasons for deception, you should related this to the fact that deception is intentional cf the refs below https://arxiv.org/abs/2312.01350 https://arxiv.org/abs/2402.07221 I guess it's unsurprising that the Claude model outperforms self-evaluation, because the openweiht models are very small in comparison. For models with the same capability, it would be interesting to evaluate whether they are better at self evaluation. Are there other detection methods you can try? Fully human-written
Liars' Bench: Evaluating Deception Detectors for AI Assistants Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents Liars' Bench, a benchmark of deceptive and honest responses from 6 distinct categories of deception, generated by three open-weights frontier models. Datasets are categorised based on two axes: 'object of belief' and 'reason for deception'. Various standard deception-evaluation metrics, including LLM-as-a-Judge and linear probes, are used to validate the dataset, and find differences in performance across datasets and models. This is a needed, broad addition to the field of LLM deception benchmarks. Typology of the datasets is cool & seems meaningful and reasonable. Great to identify the limitations of existing datasets e.g. true/false factual claims, deceptive cover-ups. Great to have fine-tuned some generative models, so as to not always depend on instruction via prompt (whether implicit or explicit). 58k samples is substantial Language is exceptionally clear; terms are precise ("functional definition of deception"); citations are clean, abundant and relevant; flow is easy to follow. It's good to mention that existing datasets are off-policy. The detailed comparison to MASK (and TQA) is welcome. The principled exclusion of MASK from Liars' Bench is appreciated. The identification of two axes, 'object of belief' and 'reason for deception', is novel and interesting. For knowledge-report section: great to take WMDP results and filter for knowledge. The identification of Knowledge-Report as a form of deception with particular challenges is a valid insight. The benchmark is helpfully compiled in a standard format (ChatML) with relevant info (boolean flag, model used to generate the transcript). It shows thorough thinking and forthrightness to recognise that prompt information may still be indirectly present in the assistant's response, even after the system prompt is masked. The metrics are carefully chosen to account for potential balance issues, 1% FPR on control set is a reasonable rate, and care is taken to mitigate methods which trigger on presence of mere mention of deception. Recap of linear probes is succinct. The caution against over-reliance on LaaJ is appreciated. The variety of deception-detection methods is sufficiently broad, and the methods sufficiently standard, to validate the dataset. The Future Work section benefits from discussion about how response behaviour will likely change as situational awareness and model introspection improves at larger model size and capability. Existing datasets are off-policy, true: but doesn't your dataset become off-policy as soon as it's tested on a new model? (FYI, I've seen a pre-print of a paper showing that probes trained off-policy do well on-policy, though performance is slightly harmed (~90% efficacy)) The chats are all model-generated, based on fictitious scenarios, rather than instances of deception in the wild. The Sonnet-3-as-a-Judge was the best-performing method on most datasets, yet still falls foul of judges interpreting harm-preventing deception as responsible rather than deceptive. typos: L52, L73 (should "compiles" be "comprises"?) L122: "evaluate" is justified but it seems cheeky to "propose" LLM-as-a-Judge as a black-box approach - implies (at least to me) some novelty - there's got to be public prior art here. Same with self-evaluation. What's the inspiration for the typology? Is it principled, or intuitive? You say (L79) that these settings are potentially more realistic - how do you define, measure or quantify this? Same for "challenging". I'm a little confused by the Harm-Pressure work: you filter based on models answering WMDP correctly: but I'm surprised that models wouldn't be HHH anyway, to avoid answering correctly. So could imagine these selected questions being the 25% which the model randomly succeed on? Do you determine model knowledge some other way, e.g. checking resistance to prompt perturbation? 1% FPR on control is reasonable when doing experimentation, but is surely too high for deployment. How would tighter tolerances affect the experiments? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next