|
ACT: AGENTIC CLASSIFICATION TREE |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces ACT, a decision-tree-like classification framework designed for unstructured text inputs. In this method, each split in the tree is expressed as a binary natural-language question that is answered by an LLM. These questions are iteratively improved using TextGrad to reduce impurity, and the final set of optimized questions is organized into a full decision tree for inference (see Algorithm 1). Experiments on three binary text-classification tasks: medical diagnosis, spam detection, and jailbreak detection, in which authors uses four small LLMs show that ACT often performs as well as or better than several baselines, including CoT prompting, DSPy (8-shot), TextGrad, and TF-IDF + CART. Ablation studies further demonstrate how tree depth and the number of refinement steps balance accuracy and overfitting (Table 2, Fig. 2).
1. The paper sets a clear and well-motivated goal: to develop transparent and auditable decision procedures for unstructured text, where each decision path can be directly inspected and interpreted by humans.
2. The method presents a technically elegant integration of decision tree learning and prompt optimization, in which each node in the tree is discovered through impurity-based selection combined with semantic feedback from the LLM, and once optimized, these nodes are formulated as fixed binary natural-language questions that define the decision structure.
3. The qualitative decision trees produced for the SPAM and JAILBREAK datasets are clear and easily interpretable, providing tangible evidence for the paper’s claim of model transparency and interpretability. As shown in Appendices B.3–B.4, the resulting trees present concise, human-readable question-answer paths that make the classification rationale explicit and auditable.
1. The datasets used in the study are small and primarily balanced, functioning more as toy text-classification tasks rather than realistic, high-stakes applications. For example, the DIAGNO and SPAM datasets contain only about 600 training samples each, which limits the generalizability of the results. Moreover, the experimental comparisons are incomplete: they exclude strong fine-tuned text classifiers such as BERT and RoBERTa, as well as interpretable rule-based systems and modern concept-bottleneck models that are directly relevant baselines for the paper’s stated goal of interpretable text reasoning (see Appendix B.1).
2. The paper does not provide any cost, latency, or resource accounting for the proposed method. Each node in the decision tree involves multiple LLM queries for impurity evaluation and semantic feedback, along with iterative TextGrad refinement steps, all of which are likely to require substantial token and compute budgets. However, no quantitative measurements of either training or inference cost and latency are reported anywhere in the paper (see section 3 & 4), making it difficult to assess the method’s practical efficiency and scalability.
3. There is a data exposure imbalance between ACT and its baselines. While DSPy is restricted to only 8 demonstration examples, ACT exposes the LLM to many more labeled samples during both node construction and refinement. Specifically, the impurity function $f_{purity}$ uses up to m=50 correct and 50 incorrect examples per group for feedback, meaning each refinement round involves substantial additional supervision. This discrepancy makes the comparison not strictly apples-to-apples, and the reported accuracy gains in Table 2 and section 4 (“m = 50”) may be partly attributable to the greater effective data access rather than to the algorithmic innovation itself.
I suggest that the authors might report F1-score, ROC-AUC, and calibration measures, as well as interpretability indicators such as average path length, node coherence, and results from a user study on decision contestability. Moreover, The paper should report and control data and compute exposure across methods by detailing the number of tokens, examples, and LLM calls for ACT, DSPy, and TextGrad. It should also include ablations with smaller $m$ and fewer refinement steps to match the 8-shot baseline. Finally, report cost-normalized results, which can include tokens, runtime, and dollar cost per training run and per inference example—to clarify ACT’s efficiency and scalability. |
Heavily AI-edited |
|
ACT: AGENTIC CLASSIFICATION TREE |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Agentic Classification Trees: interpretable classifiers whose nodes are natural-language binary questions answered by an LLM. Splits combine Gini impurity with LLM semantics feedback via prompt optimization. On three binary text tasks, the method matches or exceeds prompting baselines while yielding readable decision paths.
Introduces tree-structured classification where nodes are optimized natural-language questions. Combines classical impurity metrics with LLM semantic feedback in a single framework.
Offers a path toward auditable, interpretable-by-design classifiers for unstructured data.
Computational cost is not analyzed.
Training involves nested iterations with multiple model calls per node. Can you report call counts and token usage? How will the performance change with stronger frontier models? |
Fully AI-generated |
|
ACT: AGENTIC CLASSIFICATION TREE |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes Agentic Classification Tree (ACT), a method to exploit the natural language reasoning abilities of LLMs to design and train more linguistically friendly decision tree classifiers in NLP domains. The method takes inspiration from algorithms like CART and progressively splits based on optimized rules over time to maximize accuracy before the final leaf nodes. Experiments show it is more accurate than baselines and TF-IDF using CART. The final decision tree is also argued to me more interpretable.
The general idea of the paper is nice and intuitive, having user friendly splits with rules is quite appealing.
The quantitative results of the paper also look quite good, I am genuinely surprised this approach would be better than normal CoT prompting, or few shot prompting.
The method appears to be quite computationally expensive, which may drastically limit its practical usefulness.
There is nothing to say that the method will produce interpretable rules for the splits as far as I understand. Or at the very least, the logic behind the final tree will not be readily accessible to users. For example, why split on rule x at layer y?
The experiments are lacking appropriate baselines in my mind. I would like to see a comparison against e.g. a BERT classifier, or a TF-IDF with more layers in the model, or XGBoost, to understand the performance drop with the proposed method. I am also not sure why the authors limited the CART baseline to 3 depth, since their method often uses more than this, it doesn’t seem a fair comparison.
There are some sweeping claims in the paper about decision tree interpretability being easily understood and regulatable (e.g. line 141). I would want to see these tempered as I am not familiar with any evidence that this is true. Besides, it obviously depends on the tree depth, and whether or not it is an ensemble etc. These claims need to be tempered and/or reworded, and accompanied by better citations.
**Minor**
Typos line 112 and 127
Why did you limit the CART TF-IDF to 3 depth when your own method went to 4?
Can you include other baselines such as a simple text classifier?
Why did you use these datasets? There are other more standard text classification domains. I would like to see more such as e.g. IMDB to show more generality.
Did you ever consider simply using the LLM to label concepts in the text and then train a decision tree classifier on this? There is a large literature on this, and it strikes me as an obvious baseline. |
Fully human-written |