|
DIVER : Large Language Model Decoding with Span-Level Mutual Information Verification |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes DIVER, a novel decoding method for large language models. During decoding, DIVER first identifies a divergence point and then selects a span that scores high in mutual information as well as the likelihood. The effectiveness of the proposed method is shown in diverse datasets.
- Strong empirical performance
- Slower inference speed can be mitigated by using a smaller verifier. This is a very interesting observation.
- Extensive experiments are provided.
The biggest limitation of this manuscript is its writing and lack of clarity. I believe the manuscript requires significant rewriting and may need another round of peer review.
- The use of Point-wise Mutual Information (PMI) is poorly motivated. I am not convinced why or how Equation 2, which adds the PMI score to the logits, would improve the decoding process.
- PMI is not properly defined. It is currently defined implicitly in Equation 6. However, since PMI plays a central role in this paper, it deserves paragraphs dedicated to its definition and discussion of its characteristics.
- In computing PMI (Equations 6 and 7), the probability p(x|y) needs to be obtained, but it is unclear how this quantity is computed. Since an LLM only models p(y|x), this probability is difficult to compute. Although footnote 2 comments on this issue, it does not clarify how the probability is calculated. Additionally, the term “backward teacher-forcing decoding” is undefined.
- The symbol “~” is used in an unusual way (in Equations 1 to 6). Typically, “A ~ B” denotes that the random variable A is sampled from a distribution B. However, in Equations 1–6, the right-hand side is not a proper distribution.
- The description of DIVER in Section 3.2 is convoluted, and I do not think a practitioner could reproduce the method by reading this section. Providing an explicit algorithm would help, for example.
See weaknesses |
Fully human-written |
|
DIVER : Large Language Model Decoding with Span-Level Mutual Information Verification |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DIVER, an inference-time method designed to mitigate model hallucination. By identifying divergence points during decoding and concurrently computing PMI scores for the next k dynamic spans, the method selects tokens that are more faithful to the input, achieving better results across multiple tasks.
1. Writing is clear and the experiments are comprehensive.
2. Method is novel and achieves better performance than other decoding-time approaches.
1. Practicality concerns. DIVER decode several candidate spans in parallel and invoke the LLM for every PMI computation, incurring substantial latency. As Table 4 shows, speed drops to roughly 60 % of vanilla decoding, yet Table 3 reveals only marginal gains on many benchmarks, raising questions about the cost-effectiveness of the method in real-world deployments.
2. The choice of span length requires further investigation. Table 3 shows that DIVER_R outperforms DIVER_L, indicating that richer information is more important than considering more divergence points. However, the authors did not justify the rationale for defining the Dynamic Span based on the first occurrence of a risk point (either left or right). The impact of SPAN length and the number of included divergence points on performance remains to be further explored.
1. Could you report performance under equal computation (FLOPs)? For example, compare the BoN from two vanilla random samples with DIVER.
2. Could you demonstrate the effectiveness of your dynamic span method? For instance, plot how performance changes as the span length increases and as the number of skipped risk points grows. |
Lightly AI-edited |
|
DIVER : Large Language Model Decoding with Span-Level Mutual Information Verification |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes DIVER, which is a decoding method utilizing span-level pointwise mutual information (PMI) verification. DIVER uses token probability information to identify divergence steps, generates candidate steps and computes the PMI scores by assessing the log-likelihood gains of the input if the candidate spans are generated. The optimal span is selected based on the PMI re-ranked output distributions. DIVER also uses an adaptive method for obtaining token spans with dynamic lengths along the generation. The paper presents experiment results across several task domains to demonstrate the performance improvement of DIVER.
1. The proposed algorithm in the paper is well explained, effectively using visual examples.
2. The experiment was conducted across several task domains to present DIVER's effectiveness.
3. The paper also includes analyses regarding the potential limitation of the proposed algorithm.
1. Some crucial details of the algorithm seem to be missing, such as how many candidate spans were generated when the algorithm encounters a divergence point.
2. If DIVER is generating several candidate spans per divergence point, the experiment results could have been more convincing if the comparison was also done against baseline algorithms that also generate several candidate responses or partial sequences.
3. The experiment results do not include information such as standard deviation, which is crucial for the credibility of the results.
## Questions
1. Could you let us know how many example spans were generated during the experiments?
## Suggestions
1. I think the caption of Figure 1 should be polished.
2. Figure 3 caption: I think `Bleu` should be `Blue`.
3. I think images in Figure 4, 5, 6 should be bigger. The paper might be able to save a few lines by polishing the main text.
4. Table 12: `Give you codes that start with a ...` does not seem to be a correct sentence.
5. Table 12: Could you have a look if the given prompt was copied correctly into the appendix? |
Fully human-written |