|
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposed an evaluation method to assess whether LLMs faithfully relied on intermediate reasoning structures (e.g., rubrics, checklists, proof graphs) during inference. Specifically, the authors observed whether model outputs changed by modifying either the intermediate structures or the input text, thereby demonstrating whether model decisions are driven by these structures. Experiments showed that while LLMs relied on intermediate structures, this dependency exhibited weak causality. Furthermore, models appeared to be more faithful to their self-generated structures than to gold-standard ones.
This paper aimed to explore the faithfulness of LLM-generated content to intermediate reasoning structures, which held practical significance for understanding LLM causal reasoning capabilities.
$\bullet$ The main limitation lies in the lack of a clear research objective and insufficient consideration of the complexity of natural language.
(1) For autoregressive language models, what exactly does the intermediate reasoning representation $M^* \neq M$ refer to? What level of discrepancy is being considered, token-level or semantic-level? If token-level, $M^* \neq M$ does not necessarily imply the final decision $Y^* \neq Y$, so Eq. (2) may not hold. If semantic-level, how is semantic inconsistency defined and quantified? Semantic variation can range from minor paraphrases to contradictory meanings. Even when $M \neq M'$, both $Y=Y'$ and $Y\neq Y'$ (as in adversarial examples) are possible. Which case do the authors intend to capture?
(2) How is unfaithfulness defined? In most natural-language tasks, “unfaithful reasoning” is not well-defined. Even in mathematical or programming tasks, multiple reasoning paths (M) may yield the same outcome. Why should obtaining the same Y from the same X but a different M necessarily imply unfaithfulness? The transition from $p_\theta (Y∣X,M)$ to $p_\theta (Y^* ∣X, M^* )$ with $M^* \neq M$ does not strictly guarantee that $Y \neq Y^* $.
(3) Measuring causality through output change is valid only under restricted settings. For many tasks (e.g., open-domain dialogue), what constitutes causality is unclear. Even in tasks with well-defined causal structures, modifying M may not necessarily change the output Y.
$\bullet$ The paper should provide explicit mathematical definitions and quantitative measures for key concepts. What is LLM unfaithfulness, and how is it measured? What qualifies as a *logically significant edit*, and how are modifications to M quantified? How is *causality* defined across tasks, and how can it be measured?
$\bullet$ The authors argue that “If an LLM is faithful to its intermediate structures… then logically significant edits should change its final decision.” This statement is an assumption, not a fact. Modifications to intermediate structures may alter the probability distribution of the model's output, but they do not necessarily change the final decision. For instance, the top-probability token might remain the same even if its confidence decreases sharply. Should this be considered faithful or unfaithful?
$\bullet$ The experimental conclusion that “current LLMs treat intermediate structures as context rather than true mediators of decision making” has already been supported by prior work. Similar findings have shown inconsistency between generated chains-of-thought and the model’s actual reasoning processes [1–6].
[1] J. Chua et al. Are deepseek r1 and other reasoning models more faithful? arXiv preprint arXiv:2501.08156, 2025.
[2] B. Baker et al. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025.
[3] Y. Chen et al. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025.
[4] A. Zhang et al. Reasoning models know when they’re right: Probing hidden states for self-verification. arXiv preprint arXiv:2504.05419, 2025.
[5] M. Turpin et al. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. NeurIPS, 2023.
[6] I. Arcuschin et al. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025.
See weaknesses. |
Lightly AI-edited |
|
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper explores the role of structured CoT in LRM response generation. The authors find that while models rely on the intermediate structure generated in CoT, changes to such structure does not always lead the model to produce the logically sound conclusion from the edited structure. Additionally, models respond even less to changes in ground truth intermediate structure compared to ones they generate, even if their structure is wrong.
The paper is well-written and clear, and addresses an interesting problem of the logical consistency of LRMs and their faithfulness to their own CoT explanations. The focus on structured CoT makes evaluations more clear since the ground truth answer is usually clear.
My main issue is the following:
If the generated structure (e.g., rubric) is edited but the prompt (e.g., the student's answer) is not edited, doesn't this generate a contradiction between the prompt and the CoT? If this is the case, could the authors clarify what the expected output of the model should be? If the edits are introducing a logical contradiction then, to me, it isn't clear that the model should be expected to answer according to the content of the rubric (this would require it to actively ignore the content of the prompt). I feel that the significance of the results in this work rests heavily on whether the observed behavior is surprising/unexpected, and this is the main reason for my low score.
Other thoughts:
- In the HSVT setting (Fig 5), it is not clear which (if any) of the prompt-level interventions lead to a logical contradiction with the structure. e.g., simple paraphrasing probably does not create a contradiction, but global edits (replacing the original student's answer with a different student's answer) may or may not contradict the structure (maybe the replaced answer would generate the same rubric/score). This seems like the crux of the analysis in my opinion and does not appear to be clearly evaluated by the work currently.
- One of the main outcomes of the work is that when the structure is changed, the model often does not change its final answer to be congruent with these changes. When the intermediate structure is changed but the prompt is not, and the structure would imply a different final answer than the prompt (e.g., the edited rubric implies the score for this question is 4 but a correct rubric for the answer in the prompt would score the question 2), does the model answer 2 or 4? Or neither? i.e., is the model completely wrong or is it just ignoring a rubric that is logically inconsistent with the answer?
- More general question (won't affect my score, just curious what the authors think): Is there a way to formalize the distinction between the types of prompt edits that should cause a change in the model's downstream answer vs edits that a good model would be robust to? e.g., we know that paraphrasing or rewriting keeps the same information content/logical conclusion, while the global edits of replacing the whole answer likely changes the information content substantially.
Overall, I think the premise of this work is interesting and I would be amenable to raising my score if my concerns are addressed, especially the key concern at the top.
Most of my major questions are in 'weaknesses'.
Minor questions:
- Could the authors provide more detail about the different benchmark datasets? Only 2 example prompts were given.
- Another conclusion from the paper is about the models' faithfulness to gold vs predicted structures. This is an interesting result, but isn't there usually a single way to populate the structure correctly (e.g., thinking about the rubric example)? If so, isn't this more about the predicted structure being wrong? How often is the predicted structure wrong? |
Fully human-written |
|
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper investigates the causal role of intermediate structures like rubrics and checklists that LLM generate as part of their decision-making process. The paper propose an evaluation protocol designed to test whether these structures act as true causal mediators for the final decision or decorative by-products. The protocol introduces three metrics which are HSVT, LEC, and GEC. Applying this protocol across various models and benchmarks, the paper reports a systematic gap that the model appear consistent with their structures when the input is pertubed but frequently fail to update their output when the structure itself is directly edited.
1. The paper is well-written and easy to follow.
2. The finding were clearly stated and demonstrated, with broader implications for understanding reasoning models.
3. The evaluation is comprehensive, testing 9 LLMs across 4 diverse benchmarks.
1. The experiment result should show that it is statistically significant (e.g., use CI, paired t-test) not the point-estimate averages in Figure 3 and 4.
2. The metric appear to be binary, which could be a limitation for quantitative task.
3. Faithfulness metrics partly conflate invariance with mediation, which I think the author should specify the definition of 'faithfulness'.
4. The paper should evaluation additional prompting beyond few-shot, as improved edit sensitivity under stronger prompt would suggest that the reported causal mediation gaps may stem from prompt under-specification.
5. The paper need to validate the core assumption that the interventions are not OOD.
6. Even though the paper compared with various LLMs, it omit leading proprietary models like GPT or Claude limiting the generality.
Look at the weaknesses |
Fully human-written |
|
Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies whether LLM's answer faithfully follows from its reasoning by perturbing the reasoning and seeing if the answer is then still consistent with the new reasoning. While previous studies have learned that LLM reasonings are not faithful to its answer, they have not studied this with a causal analysis. This paper uses four datasets that have structured intermediate reasonings such as rubrics, and change the rubric scores and see if the answer follows from the new score or the previous. They did these different interventions: HSVT where they hold the intermediate structure M and change the input, such that relying on the structure should give the same answer despite the change in input; LEC, where individual elements of M are fliped, expecting the answer to change accordingly, and GEC, where all valid support should be flipped. And they experimented with both the model's original reasoning and the ground truth one. Although on three datasets the ground truth one gives higher accuracy than the model generated one, but surprisingly it's the opposite for Entailment Bank. This is likely because using the gold intermediate structure was out of distribution and thus made the performance worse. They did find a correlation between structure prediction accuracy and final answer accuracy, and GEC is more consistent that LEC. The findings from the paper suggest that if we were to do AI and human collaboration by having humans intervening the intermediate steps, LLMs are currently still unreliable because they often don't follow from the new reasoning steps.
1. The paper is well-written and easy-to-follow.
2. The question being studied is important: to causally analyze whether the model can follow any reasoning. If not, then we can't even expect it to have improved performance when intervened upon.
3. The experiments are comprehensive. The adaptations of the datasets for the use in this paper is great.
1. No error bars in the bar charts.
1. For Figure 6, it seems that HSVT is quite high for most models. Could we say that, in general, if the intermediate structure is in-distribution, then the answer might be more consistent with the structure? In HSVT, the intermediate structure didn’t change, and changing the input X is farther away and thus maybe has less impact on the final prediction. Meanwhile, in the other cases (global/local edits), the intermediate structure is changed and thus more out-of-distribution, and it’s closer to the answer, which induces more uncertainties. What is your intuition on this?
2. For EntailmentBank, it seems that if a model is more faithful on gold, then it is less faithful on predicted intermediate structure, and if it's more faithful on predicted, then less faithful on gold (Llama-3.2-1B is reversed from the other models but the trend is the same). Do you have any intuitions on this? |
Fully human-written |