ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 6.00 3.00 1894
Fully human-written 3 (75%) 5.33 3.67 4375
Total 4 (100%) 5.50 3.50 3755
Title Ratings Review Text EditLens Prediction
Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces NH-Fair, a benchmark for "Fairness Without Harm" (FWH). Its core contribution is demonstrating that a rigorously tuned ERM baseline (using a novel DTO/FWH selection) often outperforms complex bias mitigation methods. The benchmark is extended to LVLMs, finding they still suffer from bias and that scaling does not automatically fix it. 1.Practical Problem: The "Fairness Without Harm" (FWH) principle is highly relevant for real-world deployment (e.g., healthcare). 2.Strong Baseline: The paper's key finding—that a well-tuned ERM can beat specialized algorithms—is a crucial, critical finding for the field, backed by extensive HPO. 3.Rigorous Protocol: The DTO/FWH model selection strategy provides a novel and fair method for comparing models. 4.Timely LVLM Analysis: The paper correctly identifies that LVLMs are not a panacea for fairness, a critical and timely observation. 1.Limited LVLM Scope: The LVLM evaluation is zero-shot only. This is a major gap, as models are typically fine-tuned, a process which could significantly alter fairness outcomes. 2.No Intersectional Analysis: The benchmark only considers single sensitive attributes, ignoring intersectional biases (e.g., race and gender), which can be more severe. 3.Lacks Deep Insight: The paper is excellent at observing (e.g., "RandAug works well" , "Optimizers matter" ), but provides little analysis as to why these phenomena occur. 1.LVLM Fine-tuning: How do you expect your zero-shot LVLM findings to generalize to the more common fine-tuning setting? 2.HPO Cost: Your finding relies on 10k+ GPU-hours. What is the "budget-friendly" HPO recommendation for a practitioner who cannot afford this? 3.Waterbirds: You note Waterbirds is a spurious correlation dataset, not a social bias one. Why include it in a social fairness benchmark if it's "easier to resolve" and may dilute the main message? Lightly AI-edited
Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a comprehensive study of fairness-utility trade-off in modern vision and multimodal foundation models. The authors create a new benchmark consisting of various domains and tasks, a systematic model selection protocol, and a set of group fairness and utility metrics. This benchmark is used to evaluate and compare SoTA models, training strategies and debiasing methods to explore the optimal recipe for accurate and fair models. Experiments show that a well-optimized baseline with large-scale pretraining and data augmentation matches or beats sophisticated mitigation methods in fairness while maintaining performance on utility tasks. - This work addresses several shortcomings in fairness evaluation in prior work, such as inconsistent hyperparameter selection, overlooked utility performance and not including pre-trained foundation models. - The evaluation framework is systematic, consisting of multiple datasets and domains, principled training and model selection, and numerous utility and bias metrics. This is crucial to ensure fair comparison between methods as well as reproducibility. - Extensive ablation experiments on training practices such as model size, batch size and weight decay, as well as evaluation of VLMs putting them in the same playing field with dedicated classification models. - The proposed benchmark is limited to classification task. Despite being reformulated for image-text matching (CLIP) and generative (LVLM) models, it does not truly address utility and fairness beyond closed-set predictions. I would have liked to see open-set tasks like free-form image-text retrieval and open-ended VQA, captioning or reasoning, as these are the tasks where vision(-language) foundation models truly overtake task-specific vision models. This would also enable the holistic evaluation of VLM debiasing methods, which also tend to overlook utility performance while optimizing for fairness. - Even in classification, I am slightly concerned that the performance is already saturating for most models and datasets, with group accuracy gap under 5% and DP, EqOdd over 95%. I am not sure how perceivable these disparities are in practice. It might also be helpful to construct a "fairness-hard" subset with more challenging datasets/subtasks (e.g., more attributes in CelebA correlated with either gender; only "wavy hair" is used in the current version). - In general, I feel that the conclusions from the experiments confirm anecdotal observations bias mitigation (e.g., that hyperparameter tuning and model selection make a big difference), but not groundbreaking in context of existing fairness benchmarks (e.g., FFB https://arxiv.org/abs/2306.09468). - Certain conclusions in hyperparameter choice are qualitative (e.g., optimizer choice and learning rate are more important than other hyperparameters) and could have benefitted from a quantified sensitivity metric. See weakness section. In addition, I wonder if combining data curation and algorithmic mitigation can produce the best trade-off between utility and accuracy? Should future research explore the intersection between both, or prioritize data-centric methods as suggested in the paper? Fully human-written
Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this paper, the authors created a big standardized benchmark called NH-fair. The main idea is to be able to compare debiasing methods against an ERM baseline. So they first tunned the base model (ERM) exhaustively using a DTO selection strategy, for over 10,000 A100 GPU hours, which proved that optimizer choice and pretrained wiehgts are crucial, fixing these choices leads to unfair comparisons. The results show that most specialized debiasing algorithms do not realiby outperform this ERM baseline. Instead, simple data augmentation (like RandAug) often achieves "Fairness without harm"(FWH), improving fairness without sacrificing utility. From their analysis, they revealed that huge LVLMs still exhibit significant subgroup disparities and simply scaling model size doesnt guarantee fairness. I think the paper has a good originality as the NH-Fair unifies bias mitigation evaluation across tradicional supervised vision models and also LVLMs. It has a big significance as they introduce the FWH methodology. They did a big execution, solving the idea of insufficiently trained baseline problem as they spent over 10,000 A100 GPU hours with exhaustive hyperparameter optimization. The paper skips theoretical fairness concepts, as they only focus on group parity metrics, omitting individual fairness and counterfactual fairness. Future work should try to incorporate these for a wider view of bias. Most part of data was about human faces, for future work needs to include more visual domains. They did not fully use all sensitive attributes for evaluate intersectional groups (like gender vs race). LVLMs were evaluated only with zero shot prediction mode (classification) even the paper suggesting finetuning. Data Augmentation is claimed to emerge as a promising practical strategy that "most often achieves fairness without utility loss" . Given that the FWH methodology allows for model selection in either the Optimal Zone or the Sub-Optimal Zone, could you quantify, across all seven datasets, the success rate of a method like RandAugment when classifying its best-found model into the Optimal Zone versus the Sub-Optimal Zone? Fully human-written
Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a new benchmark for bias mitigation methods across 7 image classification tasks. It benchmarks traditional mitigation methods with a ResNet18 as well as vision-language models like CLIP and LLMs. It also includes analyses on the impact of hyper-parameters on model fairness. S1: The authors conduct a systematic hyper-parameter optimisation and model selection protocol, a step which is often overlooked in bias mitigation benchmarks. S2: The authors present some unique analyses which are not usually discussed in bias mitigation papers, for instance on the impact of hyper-parameter choice on fairness metrics (e.g., choice of optimiser appears more impactful than choice of weight decay), whether pre-training is done (I liked the plots in the Appendix Fig 5), and the size of LLMs. S3: They conduct extensive experiments and include a range of model types. S4: I appreciated this insightful point: "Over-reliance on datasets like Waterbirds may therefore underestimate the difficulty of fairness challenges and overstate algorithmic effectiveness." **Major weaknesses** W1: My primary criticism to the paper is that I am not sure what the field needs is another benchmark suggesting that overall ERM performs better/more reliably than existing mitigation methods (I would say there is already a broad consensus on this). I would argue that actually, instead of having an aggregate benchmark where many methods/models are compared across different datasets and metrics, it would make more sense to do a tailored analysis looking at which mitigation methods work when. This is I believe in line with current research in the field [1,2,3,4] which suggest that it is important to understand what the cause of bias is, what type of bias it is, what fairness metric you want to optimise for, in order to determine if you should conduct mitigation and with what method. Similarly to this point, I think looking at so many fairness metrics at once is also counter-productive, as you will only really care about optimising one metric for a given setting (for instance there are very few settings where it would make sense to optimise for demographic parity) [5]. W2: I think the statement that "the scaling law" does not apply to AI fairness is over-stated given that the authors have only conducted a couple experiments on classification. W3: The authors say they "suggest not confusing fairness datasets with domain generalization dtatasets (e.g., Waterbirds, Colored MNIST). While DG datasets probe robustness to distribution shifts, they do not contain socially meaningful sensitive attributes". While I agree (as mentioned in S4) that it is important to not just evaluate mitigation methods on over-simplistic spurious correlation benchmarks like Waterbirds, I disagree that there is an inherent difference between domain generalisation and fairness datasets. I think what matters for bias mitigation methods is just to have some coherent grouping of data (and that can be based on demographic attributes or on other groupings, like where the image was taken, what device was used etc.). You can have super "simple" socially meaningful groups (e.g., skin tone in skin lesion) the same way you could have very complex “domains" (e.g., geographic location of a sample) so I don't believe there is a hard difference for the mitigation algorithms. Furthermore, I don't think the authors can argue about using socially meaningful groups when they are using very un-meaningful classification targets for some of their experiments (e.g., whether someone has wavy hair in CelebA)!. W4: I generally think there are a lot of tables that are hard to interpret. Many also do not have boldings or standard deviations and inconsistent numbers of significant decimal places. This gives the manuscript an unfinished aspect. I would recommend that the authors try to make some more of the tables into plots (for instance Table 15 on LLM size). I would also recommend that they vary the interpretations, for example when you say that optimizer choice matters, instead of just listing all the different metric values for different datasets and optimiser it would be helpful to include some summary stats, e.g., overall accuracy varies by a max of this much, or accuracy gap can increase by 20%. W5: It would be interesting for the authors to check other LLM tasks like fairness of generated outputs. W6: In fig3 it would be helpful to show overall accuracy as well because we don’t know whether the models are harming overall performance or not. The authors could also include some kind of statistical testing to see if there are any significant differences relative to ERM. W7: One of your datasets fairface has no gap (0.86)! What’s the point of including it in the fairness analysis? W8: Small concern that gender and race classification tasks may be problematic. W9: There are lots of missing details on the way the mitigation methods work, even in the appendix. This is particularly true for bias mimicking, FIS, DFR (should specify that you are retraining the last layer on a balanced distribution! - right?), CLIP/BLIP models and for related post-training debiasing methods! W10: It would be helpful to show variance of results for each method across hyper-parameters. **Minor weaknesses** W11: Min max fairness definition shouldn't have the min as you are presenting the metrics not the way they're optimised (to keep it consistent with the other fairness definitions). W12: All the left quotes are wrong. Please use `` for left quotes and '' for right quotes in LaTeX! W13: Table 1 columns are a bit confusing. I would just put the proportion of the targets. W14: Colouring the cells in table 3 to give an indication of better worse metrics would be helpful for the reader. W15: The wording in the conclusion “the fairness issue remains unresolved." is overly simplified and referring to "the fairness issue" feels imprecise, as fairness is so multi-faceted. W16: What does this wording mean in section B1 “Without additional illustration”? W17: It would be helpful to add a note on which methods use demographic information. W18: Undefined reference line 1114. W19: Table 15: you can also get an SD for the average metrics. [1] Rethinking Fair Representation Learning for Performance-Sensitive Tasks, ICLR 2024. [2] Change is Hard: A Closer Look at Subpopulation Shift, ICML 2023. [3] Mind the Graph When Balancing Data for Fairness or Robustness, NeurIPS 2024. [4] Automatic dataset shift identification to support safe deployment of medical imaging AI, MICCAI 2025. [5] Critical Appraisal of Fairness Metrics in Clinical Predictive AI, arXiv 2025. Q1: Why did the authors select the datasets they did? Why did they include two medical datasets of the same modality (skin images) instead of considering other medical modalities? Q2: Are you comparing two model selection methods, or just applying DTO to ERM and then doing fairness without harm for the rest? Later you say “best model selected under DTO or FWH criteria”. Q3: Why did the authors not include post-processing methods? these are often suggested to be the most useful in practice, as suggested here [1]. Q4: Why do you use resnet18 as a backbone? Q5: Why do you only look at group x class grouping for resampling? Q6: Is there any literature on optimisers and fairness? Can you explain your results? Similarly do you have a hypothesis for the effect of batch size? Q7: Why do you think there is so much prompt sensitivity - this seems like a significant weakness? Also for that table 17, it would be helpful to summarise the variation in results as SD or mean difference in outputs. Q8: How do you obtain SDs? Average over random seeds? [1] Oxonfair, NeurIPS 2024. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next