|
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper conducts a study over biases of text-to-image diffusion models while generating images of historical ages. The evaluation criteria is three fold: 1) examining stylistic bias, 2) historical consistency, 3) demographic representation. The authors conducted in-depth evaluation and analysis over each of these aspects. They contributed the HistVis dataset that has 30K synthetic images generated by 3 T2I models -- SDXL, SD3, Flux.1Schnell.
The paper is a great read, easy to follow, with interesting findings and extensive evaluations. The authors have designed careful and sound evaluation schemes for each of the three aspects that they are studying in the paper. I especially liked that they have used multiple VLMs for evaluation rather than only using one as a judge. All details of the study has been laid out in complete transparent detail. Multiple qualitative examples were very helpful in getting the point across for each aspect of the study. The authors also discuss and analyze their findings in detail which gives readers valuable insights.
A comparison with related studies in this direction comparing the number of samples and evaluation strategies will be helpful to better place the paper.
Are the HistVis dataset prompts all manually designed or were LLMs involved in aiding ideation or prompt design? It will be interesting to know how the authors designed these prompts. |
Fully human-written |
|
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a benchmark for evaluating how text-to-image diffusion models represent historical contexts, addressing a gap in current research which has focused primarily on contemporary demographic and cultural biases. The authors created HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art models (SDXL, SD3, and FLUX.1) using neutral prompts describing universal human activities across five centuries and five decades of the 20th century. They evaluate these images along three dimensions: (1) Implicit Stylistic Associations, finding that models strongly default to specific visual styles for certain eras (e.g., engravings for the 17th-18th centuries, monochrome photography for early 20th century decades) even without explicit stylistic instructions; (2) Historical Consistency, using an automated LLM+VLM pipeline to detect anachronisms; and (3) Demographic Representation, comparing generated gender and racial distributions against LLM-derived historical baselines. The findings demonstrate that T2I models systematically struggle with historically accurate representations, relying on learned stylistic conventions and failing to properly condition on temporal context.
* First systematic evaluation of historical representation in T2I models, articulating why this matters beyond factual accuracy—historical imagery shapes cultural memory, collective identity, and public understanding of the past, with real consequences as these systems increasingly generate educational and cultural content.
* Dataset contribution: HistVis dataset with 30,000 images across 3 state-of-the-art models (SDXL, SD3, FLUX.1), using 100 universal, temporally-agnostic activities paired with 10 time periods—this design cleverly isolates models' internal historical representations by avoiding historically-specific prompts that could encode external assumptions.
- Authors check that prompt engineering fails to mitigate biases. Mitigation experiments demonstrate that explicit instructions (adding "photorealistic" to prompts, using negative prompts to discourage monochrome) largely fail to override models' stylistic defaults
* Authors compare multiple state-of-the-art models and find systematic differences. Comparative analysis reveals model-specific failure modes (SD3 exhibits highest anachronism rates at 20-25%, SDXL most historically accurate, FLUX.1 intermediate)
- I think this is a good paper, but I am worried about the fact that demographic baseline uses LLM as "ground truth". The third metric relies entirely on GPT-4o to estimate historically plausible demographics, meaning any biases the LLM has will be encoded into the benchmark and treated as "correct" historical representation. This is particularly dangerous because: (1) LLMs are trained on internet data that reflects contemporary biases and incomplete historical records, not peer-reviewed historical scholarship; (2) the validation against Our World in Data only covers 3 out of 20 activity categories, leaving 85% of the benchmark unvalidated against any formal historical source; (3) even the validated categories use continent-level distributions while the actual evaluation uses race categories, introducing an additional unsupported mapping; (4) future work citing this benchmark may treat these LLM estimates as authoritative baselines, perpetuating and legitimizing whatever biases GPT-4o encoded. The authors acknowledge this is a "coarse approximation" and cannot replace expert historians, yet they publish quantitative over/under-representation scores without consulting actual historical demographers or using primary historical sources. This creates a circular validation problem where one AI system (GPT-4o) judges another (SDXL/SD3/FLUX), with no external ground truth. A benchmark critiquing historical accuracy should itself be grounded in rigorous historical methodology, not LLM outputs. The stylistic and anachronism metrics are well-validated, but the demographic analysis risks doing more harm than good by establishing flawed baselines as reference standards.
- Why not use actual historical sources for demographic baselines? You validate 3 categories against Our World in Data with reasonable results (MAE=4.64 for GPT-4o). Why not extend this approach to the remaining 17 categories by consulting historical demographers, census data, labor statistics, or peer-reviewed historical scholarship? Even if comprehensive data doesn't exist for all activity-period pairs, wouldn't partial coverage with real historical data be more valuable than complete coverage with LLM estimates? What were the practical constraints (time, cost, expertise access) that prevented this?
- Did you consult any historians or historical demographers during this work? If so, what was their feedback on using LLM-generated baselines? If not, would you consider adding expert validation in a revision, at least for a subset of high-impact categories (e.g., education, work, agriculture) that future users might cite most frequently?
- How do you recommend future work should use the demographic metric? Given that you acknowledge LLM baselines are "coarse approximations" and cannot replace expert knowledge, should future papers cite the demographic over/under-representation scores as evidence of bias? Or should this metric be treated as exploratory/preliminary pending validation with real historical data? Would you consider adding stronger cautionary language in the camera-ready version? |
Fully AI-generated |
|
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a framework for evaluating text-to-image models in terms of their ability to accurately represent historical context. The paper presents HistVis, a dataset of 30,000 images generated from three open-source diffusion models that were prompted to depict people performing generic activities across different centuries and decades. They then propose an evaluation protocol that examines stylistic associations, historical consistency, and demographic distribution in the generated images, comparing the results with historical data. The study reveals interesting patterns in how models capture or distort aspects of history, offering insights into biases related to style, anachronism, and representation. Overall, this work highlights the lack of historical accuracy in generative models and provides a concrete methodology and dataset to study it.
- The paper highlights an important problem of evaluating historical representation in text-to-image models when depicting generic, everyday activities and provides a clear motivation for addressing it.
- The paper provides and evaluates an interesting dataset consisting of images that depict a comprehensive set of timeless activities spanning approximately five and a half centuries, offering strong coverage across diverse historical periods.
- The findings, particularly the observation of anachronistic objects in images depicting historical time periods, are very interesting and effectively highlight an important issue in the temporal consistency of these models.
- Although the VSG score is supported by a robust methodology, the reason for evaluating biases in style associations and the explanation of the distinct style classes are not sufficiently motivated.
- The anachronism detection uses an LLM to get a list of possible objects that could be anachronistic in a given activity. How can we ensure that this list is exhaustive? Were other methods, like object detection, considered for this task?
- Evaluating demographic representation in the generated stories is an interesting direction. However, the use of LLMs to predict demographic distributions raises concerns about reliability. The paper uses the public OWID dataset to demonstrate the robustness of GPT-4o as an estimator for a small set of activities but does not account for potential data contamination, which could explain the model’s high accuracy in these limited cases.
- Figures 4 and 6 could be significantly improved to enhance clarity and effectively communicate the key takeaways. In their current form, they present a large amount of information in a single view, which makes it difficult for readers to interpret the results and understand the main insights.
With the style association results, it might be interesting to link them to the training data and see correlations/reasons for such biases. (Although it might be out of scope for this paper)
Minor typos (not really a weakness, sharing so that authors can fix them):
- Line 275 refers to Appendix 5 instead of Table 5
- Line 346 uses the word “currentumptions”.
- Line 439 says Section Q, instead of Appendix Q
- Is the stylistic predictor (to predict the style class) also trained to guess images that are not any of the 5 categories in the training dataset? As a mitigation prompt was used, what was its goal?
- The paper notes that clothing appears in 2–5% of the anachronistic images (line 322). It is unclear why clothing is considered anachronistic. Does this refer to attire that does not match the depicted time period? If so, it would be helpful to clarify how such inconsistencies were detected.
- Was the prompt used to obtain the racial breakdown for a particular time period specified by continent (as the OWID comparison was based on continent)? If so, which continent were the proportions of the generated images compared against? If not, was it assumed that GPT-4o would estimate global demographic proportions for the corresponding historical period? |
Lightly AI-edited |
|
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper evaluates the historical knowledge embedded in text-to-image (T2I) models. The authors generate images using multiple T2I models based on predefined actions and time periods, ranging from the 17th century to the late 20th century. The analysis focuses on three key aspects: implicit stylistic associations, anachronism detection, and demographic representation. Using carefully defined metrics for each aspect, the study reveals several noteworthy findings: a) different TTI models exhibit distinct stylistic associations across time periods, and these associations persist even under explicit prompting; b) the models frequently produce anachronistic elements, reflecting a lack of temporal awareness; and c) the generations display gender and racial over- or under-representation, highlighting underlying demographic biases. While the intent behind each such direction is appreciated, i have some issues with the methods used to evaluate these, which i have summarized in the weaknesses.
1. The paper explores the world knowledge embedded in text-to-image (TTI) models—knowledge that parallels that of modern Large Language Models (LLMs). By examining how these models represent historical contexts, the authors go beyond the typical focus on creativity or imagination to probe their practical understanding of reality. This represents an important and relatively underexplored research direction.
2. The findings reveal previously unexamined layers of bias within TTI models. For instance, the recurring depiction of modern artifacts such as headphones in images set in the 17th or 18th century underscores the models’ limited grasp of historical realism and their tendency to fill knowledge gaps with contemporary concepts. Addressing such issues is crucial for improving the reliability and historical awareness of future model releases.
3. The paper is clearly structured and well written, making complex ideas accessible and easy to follow.
1. The paper’s positioning could be clearer. The provided dataset, being composed of the outputs from T2I models applied to a set of prompts, offers limited standalone value to the community—apart from ensuring reproducibility. The true contribution appears to lie in the methodological framework for analyzing the historical biases in generative models. It would therefore strengthen the paper to explicitly present the work as proposing a benchmark for estimating VLM biases in representing historical contexts, with the accompanying dataset serving as an illustrative application of this benchmark to three specific models.
2. In the anachronism detection evaluation, the reported 72% alignment with human judgment is substantially low, casting some doubt on the robustness of the anachronism detection component. The concern here is less about the existence of anachronisms and more about the quantitative reliability of the reported metrics.
3. The use of Large Language Models (LLMs) to measure gender and racial representation raises validity concerns. While the paper attempts to verify these measures through domains such as education and agriculture, the scope of the evaluation appears insufficient to justify LLMs as reliable tools for assessing historical demographic biases. A more cautious approach would be to limit this analysis to tasks with concrete supporting data and to avoid the use of LLM-based estimations where empirical grounding is weak.
The paper uncovers stylistic biases across time periods, however, these would be expected on some levels due to the data available on those time periods (portraits and illustrations in older days vs black and white in 1900s vs more varieties nowadays). May be the issue can be avoided just by prompting the models to generate only real-life images. What's the authors' take on this? |
Lightly AI-edited |