|
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper augments constrained-decoding for LL(1) grammars to take into account a length constraint on the numbers of tokens. Given a prefix and a next token, one wants to only continue with that token if there exists a valid grammatical completion that uses only as many tokens as the length constraint allows. In general this problem is decidable but one has to make some tradeoffs with efficiency to avoid expensive computations that will cause masking to be expensive. The key observation of the paper is that for LL(1) computing a bound on how many tokens are needed to complete a sequence is pretty easy as the stack of an LL(1) parser tells us exactly how to do that (we just need to take the sum of all the shortest completions of each nonterminal/terminal on the stack). The approach is evaluated on very restrictive grammars to illustrate how normal constrained decoding would (unsurprisingly) fail to abide to the limited number of allowed tokens.
- I think the problem of dealing with a bounded number of tokens is a very good problem to study as this is the main failure mode of constrained decoding approaches; they tend to go on and on generating non-sense just to follow the constraints.
- I like the observation that for LL(1) whether one can stay within the bound is easier to compute
- The results are encouraging at showing that this approach provides some flexibility in setting bounds on numbers of tokens
- LL(1) is a very limited set of grammars. Most of the grammars used in existing GCD papers (e.g., https://icml.cc/virtual/2025/poster/45613) are not LL(1).
- It is a bit disappointing that the paper, despite targeting a small fragment (LL(1)) still computes an overapproximation of how many tokens a completion will result in (specifically the work doesn't consider that LLM tokens may span across different PL tokens (e.g., if an LLM uses the token *b")*, which spans a literal, a quote, and a parenthesis, eq(5) will count this token at least twice, once for completing the literal, and once for the tree completion
- Setting the token budget to 1.1X the ground truth is a bit of an arbitrary evaluation.
- The paper doesn't report the preprocessing cost of their tool (other papers who have done so in the past, had very high preprocessing costs)
- The tokens/sec reported in Table 1 are not for Sota tool. Should consider, for example, llguidance
Other comments:
The paper is missing many of the more current works for constrained decoding. For SOTA methods, they should really cite LLGuidance (https://github.com/guidance-ai/llguidance) since it squarely defeats XGrammar and other tools across all metrics, and GreatGramma (https://icml.cc/virtual/2025/poster/45613) since it supports the most complex grammars while guaranteeing reasonable processing times.
I don't understand the point at line 121-122 distinguishing top down and bottom up parsers. I feel like the problem won't be much harder for LR(1) parser. One can use some dynamic programming to compute the shortest completion possible.
- What is the preprocessing cost of the tool?
- Line 374 mentions "As discussed in... TruncProof with grreedy decdoing does not fully account for semantic correctness". What does this mean? This aspect is not discussed
- Table 1 mentions 100 syntax correct for Greedy/Ours, but not 100 for schema/exact. Shouldn't Greedy generate always the same output?
- The fact that TruncProof proposes the Exact-match so many times is a bit suspicious and makes one think the search space is designed for the tool to do so. What happens with e=2? |
Fully human-written |
|
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses a problem in grammar-constrained generation with token limits. When prior grammar-constrained generation techniques, such as syncode and xgrammar, reach the maximum token count, they stop generating, which often results in incomplete or invalid outputs. The paper proposes TruncProof, which uses LL(1) parsers to calculate at each generation step how many tokens are minimally required to complete a grammatically valid output. This allows the method to block token selections that would either break the grammar or exceed the token budget. The approach is implemented as a logit modifier, making it compatible with different models and decoding methods. Experiments on JSON generation and C code tasks demonstrate that TruncProof maintains grammatical validity under strict token constraints, while baseline methods largely fail in these scenarios.
* The paper is well-written and easy to read
* I appreciate the time complexity and space complexity analysis. It would be better if the authors could compare these complexities with prior works.
* The related work section is comprehensive and considers most SOTA CFG constrained generation works.
* The evaluation considers various sampling techniques (beam search, MCTS) during inference. In my knowledge, this has not been studied extensively in the prior CFG-constrained decoding works.
The major issue in the paper is the limited empirical evidence showing the practical applicability of the work beyond JSON generation under token-limit restriction. I would encourage the authors to perform additional experiments on one of the other grammars such SQL, Python, Java, or Go, that have been considered in prior works and use end-to-end benchmarks such as humaneval.
* The technique uses LL(1) grammar, which is inherently weaker than LR and Earley grammars supported by some of the prior works
.
* The evaluation on JSON considers only 100 examples on 2 models. Both should ideally be more.
* The evaluation for C code generation is almost non-existent. Is it just evaluated on a single example in Figure 4? Am I missing something?
* What is the motivation for the token limit restriction?
* In case of JSON generation, LLM generates some extra whitespaces/newlines which are avoidable but is it also realistic in other programming languages? |
Fully human-written |
|
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TruncProof, a grammar-constrained decoding method that leverages LL(1) parsers to estimate the minimum tokens needed to complete a valid output at each step, enabling LLMs to generate syntactically valid outputs within a fixed token limit.
The RQ is clear, and the paper is relatively easy to follow.
Conceptual clarity and motivation: The paper would benefit from clearer motivation and explanation of why enforcing token limits is necessary and how it improves decoding quality, e.g., does it improve overall accuracy or just ensure early termination? It is not fully clear why adding an LL(1)-based token constraint offers an advantage over simply applying standard grammar-constrained decoding (e.g., CFG-based methods) with an explicit token or context budget specified in the prompt. The practical benefit, especially when existing constrained decoding already guarantees syntactic validity, remains ambiguous.
Writing: Providing a concrete, step-by-step example walkthrough of each phase of TruncProof (rather than only one abstract output example in Fig. 4) would greatly improve clarity and help readers understand how the approach operates and why it matters.
Experimental limitations:
The choice of baselines is limited and may not fully isolate TruncProof’s contribution. A stronger comparison would include variants that apply existing grammar constraints or prompt-level token limits under the same decoding setup.
Additionally, the overhead introduced by TruncProof, both in preprocessing and runtime, especially whether scaling up grammar size will affect decoding efficiency, should be quantified and discussed.
Finally, the evaluation focuses primarily on syntactic validity rather than overall task performance; adding metrics for semantic correctness or downstream accuracy (e.g., in code completion, data-to-JSON generation, or semantic parsing) would help clarify whether the proposed constraint mechanism provides real benefits beyond ensuring grammaticality.
See weakness. |
Heavily AI-edited |
|
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents TruncProof. It allows for strict LLM generation adherence to the token-budget, while still generating syntactically correct outputs. The constrained generation approach is based on LL(1) parsing of context free grammars. The authors performed experiments on Text-to-JSON instructions and code generation tasks, improving over the existing approaches such as Outlines, Syncode, and XGrammar. The approach enhances the semantic robustness of the JSON and C output by leveraging decoding strategies such as Beam Search and Monte Carlo Tree Search.
* Interesting and potentially useful technique for controlling LLM output with strict token budget.
* Technical approach is intuitive and effective.
* The experimental results show that TruncProof is effective on challenging function calling/coding tasks under strict token budget.
* The paper is well written.
* The experiments are performed only for two budget threshold. A better understanding of the tradeoff between the number of tokens and the syntactic&semantic correctness would be welcome.
* The approach may over constrain the semantic generation.
The paper presents an interesting practical technique for controlling the output of LLMs. While existing constrained decoding algorithms have shown how to ensure the LLMs only generate legal tokens, they don’t have means to ensure that the output will be properly generated in a fixed amount of time (tokens). Thus, many LLM generations in JSON/coding tasks have occurred because the model was not able to close all scopes etc.
The approach is intuitive, yet demonstrated to be effective. The algorithm computes the approximate shortest token length derivable from each nonterminal ahead of time. It further track the number of tokens consumed and can adjust the rest of generation to follow the path until guaranteed completion. The authors show that this approach has reasonable worst-case time and space complexity, although I would have appreciated if it also included the estimates of expected time/space consumption (and relate the algorithm properties with the grammars they used in the evaluation).
The experiments compare TruncProof with several state-of-the-art tools for constrained decoding. The results show the significant improvement in syntactically correct codes under a tight token budget. However, the paper presents only two (the aggressive in the paper; the permissive in the appendix) limit. It is important to understand the tradeoffs for multiple token budget thresholds, between the two extremes presented in the paper and in the appendix. Showing these additional results with even one selected decoding strategy would be valuable.
Further it would be interesting to see the behavior of the approach on a reasoning LLMs and how token budgets impact their performance. Overall, with strict constraining it may be possible the model yields suboptimal results regarding semantic correctness.
Finally, I enjoyed reading the paper and found it well-written. |
Fully human-written |