|
Efficient Generative Models Personalization via Optimal Experimental Design |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces an approach to personalizing text-to-image generative models with an RL based preference learning framework. The motivation is to use optimal experimental design to efficiently search the space of prompts that match user preferences. They present qualitative and quantitative results based on the Stable Diffusion architecture.
- The paper over all is generally well communicated and theoretically grounded.
- The motivation of efficiently personalizing a model from user feedback is a pressing one.
- The authors don’t clearly outline information about the user study like how the participants were recruited and how many there were, this should be presented more front and center.
- Opinion: I am not convinced that searching in the space of prompts is the best use of this kind of method. Preference learning seems like a great potential tool for identifying user preference information that is complementary to a prompt. Why couldn’t the user just write their own prompt rather than answering >50 queries? If the information from your search procedure were complementary to text that it would be more well motivated.
- Minor weakness: the method does build upon an older text-to-image generation architecture.
- Writing Opinion: the application setting should be presented earlier in the paper and more clearly. It is not until the last page that any qualitative results relevant to the application are presented, and there is only a single example in the main manuscript. I would advise making a more high-level qualitative figure that presents your method in the context of the application much earlier.
- Did the authors consider using a different value for K than 4? Are there practical limitations to increasing K? How does this impact the convergence of your method?
- How many users did the authors use in their study? How were they recruited?
- Do the collected preferences generalize to new base prompts? Or would a user need to answer queries like this for every prompt? I.e., does this capture a general sense of a user’s “style” |
Fully human-written |
|
Efficient Generative Models Personalization via Optimal Experimental Design |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes ED-PBRL, a theoretically grounded framework for reward modeling that uses Optimal Experimental Design (OED) to select the most informative human preference queries. It formulates the preference query selection problem as a convex optimization problem that maximizes information about the latent reward function underlying user preferences from the perspective of the Fisher Information Matrix.
The authors provide theoretical guarantees, including a self-concordant bound on the mean squared error via the Fisher information matrix and global convergence results for discrete generative models. Empirical experiments—on both synthetic data and human-in-the-loop text-to-image personalization—demonstrate that ED-PBRL achieves the same alignment performance with fewer preference queries than random selection.
1. This work proposes a new upper bound for MSE between the estimated latent reward model and the ground truth through the Fisher Information Matrix via a self-concordant analysis.
2. The authors reformulate the intractable FIM maximization objective into a tractable one and use the Frank-Wolfe algorithm to obtain a solution with guaranteed optimality and convergence rate for exploration policies.
3. Both the synthetic and real-world experiments demonstrate a significant performance boost with the proposed ED-PBRL-guided optimal human query selection.
1. Policy extraction from state visitation measures is computationally inefficient. In the tabular setting, this is straightforward and computationally cheap. However, for token-level long-trajectory generative settings (which is the practical case for LLMs), the proposed methodology may be computationally intensive and impractical.
2. The scale of the experiment is limited.
1. What is the time complexity of the proposed ED-PBRL algorithm using Convex-RL? |
Fully human-written |
|
Efficient Generative Models Personalization via Optimal Experimental Design |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces ED-PBRL, a framework for efficient personalization of generative models with minimal user feedback. The key idea is to model user preferences as an unknown linear reward function and to use Optimal Experimental Design principles to select the most informative queries for learning this reward. Experiments show that ED-PBRL personalizes text-to-image generation via prompt construction with fewer feedback rounds than random exploration.
The paper aims to address an important point about designing efficient and tractable experiments to learn preferences.
- **Clarity and organization could be improved.** Some aspects of the method and experimental setup are difficult to follow from the main text. The setup in Appendix A.1 provides a clearer picture of the trajectory structure and bringing some of that material (perhaps with a small illustrative example or figure) into the main paper could be helpful. As it is currently written, one might initially think the user provides feedback after every token on a narrative description, which seems impractical—an explicit example trajectory could address this.
- **Simple baseline.** It is not obvious why an MDP is necessary to learn the preferred attributes. E.g., a simple baseline could be to learn the distribution over the vocabulary. When a certain image is preferred, upweight the weights on all the attributes used to generate that image.
- **Independence assumptions.** The formulation appears to assume that design attributes (e.g., ambient, style, etc.) contribute independently to the final trajectory. In general, such attributes can be correlated—for instance, ambience and lighting or style choices often co-vary in preferences. These are some biases that diffusion models or LLMs may already have, it might be useful to discuss whether the method could take advantage of such correlations.
- **Human evaluation improvements appear modest.** The held-out accuracy in the human preference experiments seems close to that of random exploration. Some further discussion on potential reasons (e.g., noise in human feedback, reward parameterization) could help clarify how the method can be improved.
- How does a single trajectory of generation look like? Is there any grounding of the generation so adding new design tokens makes minimal changes to the image with the exception of the required new “design”?
- Were other forms of rewards considered e.g., representing it with a network?
- What is the size of the vocabulary? How were the image attributes determined? |
Lightly AI-edited |
|
Efficient Generative Models Personalization via Optimal Experimental Design |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper looks at the problem of query selection for preference learning. A novel approach rooted in optimal experimental design is presented in this paper. The main contributions are (1) upper bound on MSE as a function of Fisher Information Matrix and converting the query design problem into an optimization problem, (2) operationalizing the optimization problem by reformulating the objective to a tractable form and proposing a novel algorithm to optimize it, (3) experiments using a text-to-image generative model with both a synthetic and human-specified reward model.
Query selection is an important problem, so the problem motivation is strong. The idea around using optimal experiment design for preference learning also seems interesting. The theoretical bounds for regularized Bradley-Terry model with a generalized linear reward model in terms of the Fisher Information Matrix could be of more general interest. I also appreciate the effort the authors put in conducting a study with human participants.
My main concerns are regarding the practicality of the proposed approach. The objective function requires computing the state visitation measures, which is very data hungry especially for high-dimensional state spaces which are usually the norm for generative models like LLMs. Hence the statistical guarantees don't scale well with state space. In addition the objective function requires inverting a matrix, which is computationally very expensive for larger reward models. The experiments are somewhat simplistic with simple reward models, and doesn't provide strong evidence for establishing the practicality of the algorithm. I recommend the authors either address these issues or add a section on the practical limitations of the approach.
I'm not super familiar with optimal experimental design but a quick search for preference learning with optimal experimental design shows the following papers:
1. Mukherjee, Subhojyoti, et al. "Optimal design for human preference elicitation." Advances in Neural Information Processing Systems 37 (2024): 90132-90159.
2. Schlaginhaufen, Andreas, Reda Ouhamma, and Maryam Kamgarpour. "Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design." arXiv preprint arXiv:2506.09508 (2025).
Please include these and any other omitted citations if they are relevant to this paper. |
Fully human-written |