ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 2.00 3.00 2581
Fully human-written 3 (75%) 2.67 4.33 2433
Total 4 (100%) 2.50 4.00 2470
Title Ratings Review Text EditLens Prediction
Value Drifts: Tracing Value Alignment During LLM Post-Training Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper analyzes value drifts in LLMs during training, including SFT and preference learning. The paper constructs a set of value-laden prompts and an LLM-as-judge to assess the stance of the model responses across training. The paper finds for its setup (i.e., the specific datasets considered in each stage of training) that for the most part value distributions change the most at the beginning of SFT and minimally thereafter, except when training with DPO on preference data where the responses are forced to represent different stances. 1. The paper tackles an important problem (value alignment) from a novel / underexplored perspective (training dynamics). 2. The paper is well-written, with clear figures, tables, experimental setups, etc. 3. The paper showcases differences in values alignment throughout training between preference learning algorithms which could be interesting avenue of further study. The main missing piece to this analysis in my mind are understanding the datasets themselves: 1. To be able to say anything meaningful about the claim that models learn during SFT, it seems important to disentangle the stance distribution of the SFT datasets in the analysis. For instance, how closely do the models match the distributions of the datasets used? 2. The SFT vs. preference learning comparison does not disentangle the fact that the datasets are using different query distributions. This seems like a very important aspect to the analysis to consider. For instance, what happens when the datasets used for SFT vs. preference learning are switched? (This general question could be asked with controlled testing using synthetic data). See the points under weaknesses above. Additional question: Can the authors provide some additional analysis / explanation of the differences between DPO and SimPO seen in Figure 4? Fully human-written
Value Drifts: Tracing Value Alignment During LLM Post-Training Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper traces value alignment throughout different parts of an LLM’s lifecycle. From pretraining, to supervised fine-tuning, to post-training. First, they measure value drift from base models and after SFT, and show how this method shifts values significantly. Then, they explore the effect of PPO, DPO, and SimPO, and see very little or no value drift. Finally, they construct a dataset that either supports or opposes values, increasing the value gap, and find that DPO drifts considerably as a response. - The paper covers different datasets, methods, and models. - It’s relevant to study what happens to the underlying values of these widespread models during different training phases. - The writing of the paper is very clear. - There are a lot of details in the paper and appendix, which highlight its reproducibility. I think studying the value drifts of these models at different moments of the training is important. However, in this paper, you approach this problem mainly from the perspective of the methods used for value alignment. To me, what matters most is the interaction with the data, and you only cover a bit of that at the end in Section 6. In Section 4, you show the effects of SFT with two different datasets and models. Using Wildchat and Alpaca leads to different values, but you don’t explore the prior distribution of these datasets, even though you mention that as a possible explanation. Aren’t you just showing that SFT works? That doesn’t seem surprising to me, and you also acknowledge that this is known (line 300). In Section 5, you show the value drifts after SFT with three algorithms (DPO, PPO, SimPO) and show that there are no significant shifts. Once again, you don’t show the prior distribution of the datasets used, even though later you mention this might be an issue. Finally, Section 6 deals with data biases and the effects of the above-mentioned algorithms. You find very little change from PPO, but as you say, there’s a penalty for separating from the SFT. SimPO suffers from no shifts as well, but DPO shifts values considerably. Once again, I feel like you’re just showing whether these algorithms work or not, and that without saying much about the data doesn’t show anything that we shouldn’t already know. From my perspective, it’d be better to study this problem mostly from a data x algorithm point of view. For example, you could generate different datasets with different distributions, such as bimodal, and see how those affect post-training. Do models just align with the mode that skews it more? Does it result in a neutral-value model? Does it overfit to support or oppose with each specific mode? I see a lot more open questions in this approach. - You mention 4 models (Llama3 3B and 8B + Qwen3 4B and 8B) but show results only for the 3B and 4B models. Am I missing something? I see Tables 10 and 12 in the Appendix, but those are just SFT results. Fully human-written
Value Drifts: Tracing Value Alignment During LLM Post-Training Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work investigates how at each stage the value of the alignment drifts during the model's post-training. The findings indicate that different preference optimization algorithms lead to different value outcomes. To measure these drifts, the work construct V-PRISM consists of a total of 550 prompts. The drifts are then calculated by classifying the stance of the model at regular timestamps. The model's post-training drifting is measured using two metrics: drift magnitude and drift time. - The study evaluated several post-training methods, including preference optimization, to measure the value drifts. - The paper explores whether the low value gap in standard preference datasets results in low-value drift by using two distinct scenarios of support-aligned and oppose-aligned, where the preferred labels are switched. - There is no explanation provided for why models adhere to the values that were aligned during the SFT phase of preference optimization. All results are discussed empirically, which are also limited. Additionally, there is limited discussion on how different datasets used during SFT result in varying magnitudes of drift during the preference optimization. - The study fails to offer practical recommendations for mitigating value drifts during preference optimization for algorithms such as DPO. This results in increased drifts compared to PPO and SimPO, or when an oppositely aligned dataset is employed to maliciously manipulate the alignment of LLMs. - I am not sure about the reasoning behind studying the drift for PPO, as it has a KL-divergence term in its formulation to mitigate the same issue. The paper also mentions that PPO has the lowest drift with this explanation. The contributions would be much higher if other algorithms like DAPO had been studied that do not include the KL divergence terms. - The stance is only limited to 3 options. It does not tell how much the model is in support or opposition to a topic. Furthermore, measuring the stance of a language model using an LLM (GPT-4o) is not a robust way to do so, as the LLM will have its biases. A better approach would be to get stances from multiple LLMs or incorporate human evaluation. - The rationale behind employing the oppose-aligned dataset remains unclear. Preference optimization aims to align the model’s behavior with desired preferences. Consequently, utilizing the opposing-aligned data to widen the gap will undoubtedly lead to a shift in value. What is the reason behind selecting Llama3 and Qwen3? Why specifically choose Llama3 and not the other variants? Lightly AI-edited
Value Drifts: Tracing Value Alignment During LLM Post-Training Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In the paper, the authors focused on an important topic, how models acquired value preference through post-training. In particular, they proposed a method to quantify the shift of model value during post-training phases: 1) SFT and 2) RL. They empirically exam this proposed method for 4 models on 4 datasets against those 2 post-training phases and presented their findings. The paper is well-written and easy to follow and understand. And the authors studied a very interesting and important topic: the models’ output value preference and how the models acquired the value preference through post-training. I listed a few weaknesses as below: 1. Overall, the paper is more like an empirical study of model value alignment. It did not propose novel algorithm or framework to perform better model human value alignment, neither construct new dataset or benchmark to perform model human value evaluation. The contribution of the paper is somewhat limited. Seems to be a preliminary work that needs deeper study on this topic, for example, how to efficiently (with less compute/less data points) align model with human value? Does different value affect each other during post-training? Can one alter the model output’s value by performing another round of SFT/RL, if so, how to achieve it in a more effective way? 2. It would be interesting to check if any of the dataset used in the SFT and RL stages actually contain data points that are related to the topic/values examine in the evaluation phase. If they do contain data points, say related to immigration, then does any of the data points indeed represent a negative/neutral/positive viewpoint which later on pick-up by the model? This is not clear to me. I think the proportion of related data in the training datasets are more important than algorithm used for the post-training phases, however, this is not examined thoroughly by the authors. 3. The coverage of datasets/model/post-training algorithm is not comprehensive, which make the conclusion/observations in the paper not very convincing. Even though there are compute constraints, there should be explanation on why only examining two model family and small size model on limited datasets will be enough to draw the conclusion. 1. Can someone just inject data points with strong opinion in the training dataset used by SFT or RL, and completely alter a model’s value? Is it more effective to do so during post-training than pre-training? What about PEFT algorithm for fine-tuning? This is not studied in the paper. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next