|
StoryAlign: Evaluating and Training Reward Models for Story Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper addresses the problem of reward modeling for story generation and makes three main contributions. First, it introduces STORYRMB, a human-curated benchmark with 1,133 instances, each containing a premise, one chosen story, and three rejected stories. This benchmark is designed to evaluate how well reward models can capture story preferences. Each instance is created by feeding a premise (from a collected pool) to both LLMs and humans to generate full stories. These story candidates are then evaluated by another set of LLMs for preference judgments, with human annotators resolving disagreements.
Second, the paper presents an automatic method to obtain large-scale story preference pairs.
Finally, it introduces STORYREWARD, a reward model trained on the automatically collected data that achieves state-of-the-art performance on the STORYRMB benchmark.
Overall, the paper is rich in content, backed by a substantial amount of work and a solid discussion. The methodology discussions are particularly strong, and the evaluations cove the contributions thoroughly and providing convincing support for the findings. That said, there are a few smaller issues that, if addressed, would definitely strengthen the paper.
1. There are several design decisions that could use more justification through ablation studies. For example, the *Selection with Dimensional Categorization* section (starting around line 169) or the various methods for collecting preference pairs mentioned in Section 3.1. It’s not entirely clear how much these different approaches contribute to the final dataset or the reward model’s quality.
2. Some additional discussion in the results section would also be helpful:
- There’s a large gap between StoryReward-Llama and StoryReward-Qwen in Table 1 that isn’t really explained.
- The paper also misses an opportunity to use the preference reasons mentioned around lines 173–175 in the evaluation.
3. A few methodological details are missing:
- What are the sources for the initial set of stories mentioned in line 210? Are they the same sources discussed in ETHICAL CONSIDERATIONS?
- The algorithm described in Appendix F feels vague and could use more clarification.
- How much overlap exists between the data used to train StoryReward and the STORYRMB benchmark?
Q1: I wasn’t entirely sure which part of the paper deals with the notion of “strong disagreement” mentioned around line 69. Could the authors clarify that? |
Lightly AI-edited |
|
StoryAlign: Evaluating and Training Reward Models for Story Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors of this paper highlight a critical issue in reward models for creative tasks, in that they often struggle to accurately align to human-preferences. In this paper, the authors focus on one such creative task, story-generation, and develop a first-of-its-kind benchmark to evaluate reward models for story-generation (StoryRMB). StoryRMB is comprised of >1000 datapoints, each including one gold-standard story, and three rejected stories for a given prompt. The authors then develop a reward-model for story-generation, called STORYREWARD, trained on a custom-curated dataset of 100,000 examples. The authors compare STORYREWARD to relevant baselines to highlight two key findings: (1) STORYREWARD is able to better distinguish between human-written and LLM-generated stories, and (2) STORYREWARD outperforms existing approaches in performing a Best-of-N ranking, which is an frequent task performed by reward-models.
- The paper is well motivated. The authors clearly define their research questions and effectively setup the problem with reward models for creative tasks, specifically storytelling. This was an enjoyable paper to read.
- This paper provides a first-of-its-kind benchmark in storytelling to evaluate reward models for storytelling. An important research question in the current space of LLM-research is the performance degradation of reasoning-based models on creative tasks. StoryRMB provides a helpful dataset to test the creative-writing competencies of a model.
- The authors made intelligent and targeted design choices in their data generation procedure to collect data for STORYREWARD. Through their experiments the authors validated that these choices enabled their reward model to better identify human-written stories compared to LLM-generated stories.
- I appreciated the experimental rigor in validating the assumptions made in this paper.
- I noticed that more than half of the examples (54,000 examples) in the dataset were generated through a procedure proposed in prior work (Cui et al. 2024). I’m curious as to the impact of this methodology compared to the novel procedures proposed by the authors. Specifically, since premise-backtranslation and human-guided continuation only comprised of 10% and 6% of the final dataset in its entirety, I wonder how critical they are in the performance of STORYREWARD. I would have been interested in an ablation study where each data collection method was individually excluded/replaced during training to evaluate its contribution to the reward model’s performance.
- The authors state that existing reward models often prefer LLM-generated stories over human-written stories. However, during the “human-guided continuation” data collection procedure, they choose to utilize the human-continuation to direct LLM-generations rather than include the human-stories themselves. Isn’t this a counter-intuitive design choice, based on the initial assertion regarding existing reward models? Can the authors justify this choice? Did the authors run any experiments where the human-written completions themselves were included in the dataset?
- The authors state that R is the set of rejected stories. It is unclear to me how this set is computed. Is R the entire set except for the chosen datapoint, or is R a set of datapoints that fall under a given threshold for the combined score. If it is the latter, in equation-1, why do the authors compute a difference with respect to only the mean of the rejected samples instead of all samples.
- In addition to consistency across models, why did the authors not compute self-consistency within the generations of each individual model? |
Fully human-written |
|
StoryAlign: Evaluating and Training Reward Models for Story Generation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the challenge that current reward models fail to capture human story preferences. The authors introduce STORYRMB, a benchmark of 1,133 human-verified preference cases comparing stories across coherence, creativity, characterization, fluency, and relevance. They show that existing reward models and LLM-as-judge systems perform poorly on this benchmark. To improve alignment, they construct STORYREWARD, a reward model trained on ~100k story preference pairs derived from human-written stories and controlled rewrites. STORYREWARD outperforms baselines and improves Best-of-N story selection, demonstrating the need for story-specific preference modeling.
1. Data creation process is rigrous.
2. This work addresses a core gap in story generation by directly modeling human narrative preference, and contributes valuable resources to the community through a high-quality human-verified benchmark and large-scale preference data, which were previously lacking.
1. Training details for the reward models require clarification. It is currently unclear whether STORYREWARD is simply obtained by fine-tuning Qwen and LLaMA, or whether additional architectural or training modifications are applied.
2. The paper would benefit from including straightforward fine-tuning baselines for the classification tasks, in order to contextualize the improvements introduced by STORYREWARD.
3. While STORYREWARD performs well on human–LLM preference classification, its performance notably declines on LLM–LLM preference pairs. The paper does not provide a clear explanation for this discrepancy. Further analysis is needed to understand why the model fails to generalize to LLM-generated comparisons, and what this implies about the nature of the learned preference signal.
4. Given the degraded performance on LLM–LLM comparisons, the reliability of STORYREWARD for best-of-N sampling warrants further scrutiny. If the model struggles to distinguish higher-quality continuations among LLM outputs, it may not consistently guide the selection toward genuinely better generations.
See weakness. |
Lightly AI-edited |
|
StoryAlign: Evaluating and Training Reward Models for Story Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a benchmark for human story preference, consisting of 1,133 human-verified data instances, and builds a reward model for story generation based on large-scale training data obtained with the proposed automated data collection method. The contributions span in three folds: a benchmark showing that existing reward models fail to capture human story preference well, an automated method to collect large-scale preference data for training, and developed a reward model with the collected data with impressive performance.
1. The task is interesting and well-justified. This paper develops a benchmark and also training data to build a reward model for story generation.
2. The performance even with a small model is promising, especially on improving Human-LLM pairs. Experiments on best-of-n sampling denotes that the developed reward model improves test-time scaling.
3. A lot of effort seems to have been spent on human evals, and collect a fairly large dataset (although how much is LLM written is slightly unclear).
1. While the author argue that human story preferences are inherently subjective, this is not aligned with the goal of the reward model to produce concrete scores for stories.
2. The approach WQRM from https://arxiv.org/abs/2504.07532 should be compared. WQRM is very relevant and it collects pairwise human stories and also trains a reward model.
3. To create the data, only getting humans to annotate stories where the LLM-scoring disagrees with each other - but as they mention there are biases/flaws in existing LLM writing judgements, whether LLM-judges all like/dislike the same story incorrectly is unclear.
4. The size of training dataset, size of evaluation dataset, and size of their respective sources, length of stories, etc are missing.
5. 'Premise back-translation' doesn't really mean that; the authors create another premise from two similar stories and rank them by engagement. It could be called something else. Also, they have human-human story data from this method but it's not in Figure 2 so what the reward model's accuracy is on this set seems unclear.
6. For experiments, relatedly why do other methods do better on LLM-LLM comparisons than the developed model is unclear. Some analysis would be nice.
7. For human evaluation, 16 stories seem so many to evaluate. Not sure how meaningful any ranking is.
8. It would be good to see how the developed reward model perform on another writing benchmark, or to train a model and see improvement. Basically any real downstream application.
1. Dimensional categorization section is confusing. Is it all just to figure out what preference group is the most important? Maybe just do correlations?
2. Clarification question on Figure 3 - are the 'tie' sections when the models selected the same story? |
Fully human-written |