|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies how generative language models (LLMs) may interact recursively through training data that include other models’ outputs.
1. It introduces a formal framework with two parameters — the synthetic data ratio ($\alpha$) and the initial data weight ($\beta$) — to describe cross-model data mixing.
2. Theoretical analysis (Sec. 3) under a generalized linear model shows convergence and bias–variance behavior depending on $\alpha$ and $\beta$.
3. Empirical results (Sec. 4; Fig. 3–5) using OPT-350M and LLaMA-1B demonstrate that moderate mixing ($\alpha=\beta=0.5$) improves both models’ performance but leads to representation homogenization.
The paper also concludes with a discussion of the implications for model diversity and long-term ecosystem dynamics (Sec. 5).
1. Clear and meaningful problem setting:The paper addresses a timely and practically significant question — how recursive data interactions among generative models affect learning stability and diversity. The motivation and background are well-articulated (Sec. 1–2), making the research goal both relevant and understandable.
2. Comprehensive and interpretable theoretical framework:The proposed formalism based on the parameters $\alpha$ and $\beta$ (Sec. 3) systematically captures cross-model data mixing. The accompanying bias–variance and convergence analysis provides solid conceptual grounding for the empirical findings (Fig. 2–3). Even without verifying every derivation, the overall reasoning is coherent and accessible.
3. Exceptional clarity and readability:The writing is well-structured and accessible to readers beyond the immediate subfield. Explanations, figures, and notation are consistently clear, enabling a broad audience to grasp the motivation, methodology, and conclusions (Sec. 1–5).
1. Limited novelty in the modeling of cross-model interaction:The description of data-mediated interactions between models (Sec. 3) is clear and well-structured but largely descriptive. While it helps readers understand the setup, this section mainly formalizes an intuitive process rather than introducing a new mechanism or theoretical insight. As a result, the contribution of this part feels limited in terms of originality.
2. Gap between theoretical modeling and practical relevance:Most of the paper focuses on theoretical modeling and proofs (Sec. 4–5). Although the derivations appear sound, the connection to real-world large-scale training scenarios remains weak. The introduction of parameters $\alpha$ and $\beta$ is conceptually useful, yet in practice, their exact values or ratios are difficult to estimate or control during continuous training. The conclusions drawn from the linear or generalized linear setting may not easily transfer to nonlinear or high-dimensional models.In essence, while the problem definition is good and $\alpha$–$\beta$ reasoning is meaningful, it is unclear how the theory can concretely guide actual large-model training.
3. Experimental validation is narrow and idealized:The experiments (Sec. 4–5) mainly serve to verify the theory, but they do not provide further insights into realistic settings. Only two medium-sized models (OPT-350M and LLaMA-1B) and two datasets (SciQ, GSM8K) are used, with highly controlled data composition. The synthetic data are assumed to represent model outputs cleanly, without considering realistic mixtures of human and synthetic text (I know in limitation part). Scaling experiments or additional ablations (e.g., varying model size, task diversity, or realistic data proportions) would make the findings more convincing.
Overall, the experimental content is rather insufficient. The question itself is meaningful, but it does not provide much insight in terms of conclusions. However, considering that this might be a theoretical paper, it is difficult to for me to assess the practical value of such a theory. Therefore, I would lower the confidence to mitigate the possible impact of this uncertainty.
You may refe to the content in the “weakness” section. If you can address the doubts raised there effectively, I will consider giving a higher score.
I hope valuable work will not be overlooked.
For example, in addition to theoretical explanations based on existing assumptions, it would be great if you could highlight some unique insights proposed in this paper.
Perhaps the paper already includes them, but I did not notice. |
Heavily AI-edited |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the phenomenon of data-mediated interactions among different generative AI models. Specifically, it studies what happens when different generative models are trained on each other’s outputs, a realistic scenario given the increasing prevalence of AI-generated content on the internet. The authors first review evidence that modern large language models (LLMs) are trained on overlapping, internet-scraped datasets that increasingly contain synthetic text from other models. Building on this, they formalize an iterative, interactive training framework where multiple entities train models using mixtures of public, private, and synthetically generated data. Then the authors give a theoretical analysis in a linear regression setting and derives closed-form dynamics for bias, variance, and convergence properties, showing that cross-model data sharing can promote homogenization while sometimes improving efficiency. Experiments using OPT-350M and LLaMA 3.2-1B fine-tuned on distinct tasks (SciQ and GSM8K) simulate multi-model interactions and confirm theoretical predictions: moderate mixing ($\alpha = \beta = 0.5$) yields the best balance.
The paper concludes that recursive cross-model training can both diversify and homogenize generative models.
- A novel problem. The paper's most significant contribution is the formalization of a new unstudied problem: "data-mediated interaction" within a multi-model systems. This shifts the research focus from the standard "model collapse" setting (a single model consuming its own outputs) to a more realistic and complex scenario where multiple heterogeneous models coexist and interact by training on a shared data pool containing each other's outputs.
- Comprehensive Methodology: The paper supports its claims with a comprehensive methodological approach that provides both theoretical and empirical evidence. The authors develop a formal theoretical model (a linear regression setting) to make analytical predictions about the system's long-term dynamics and (2) validate these predictions with a set of well-designed experiments using large language models (OPT-350m and Llama 3.2 1B).
- Drawback of the whole setting. The theoretical and empirical framework assumes that fine-tuning data for each model is randomly sampled according to ($\alpha-\beta$) proportions (new vs. old, public vs. private). However, in practice, major foundation models rely heavily on highly curated, high-quality fine-tuning datasets that are explicitly designed to avoid noise or low-quality synthetic data. This mismatch between the model’s random-mixing assumption and real-world fine-tuning practices limits the external validity of the results — particularly the conclusions about homogenization and performance degradation under synthetic data reuse.
- The theoretical analysis relies entirely on a linear regression model with Gaussian assumptions, which limits its generality to real-world large-scale nonlinear generative models. Although the authors cite “universality” results, the mapping from this toy model to practical LLM training remains speculative.
- After examining the released code (sft-config), it appears that each fine-tuning round uses a very small effective data volume: the batch size is 2, with gradient accumulation over two steps and only 100 optimization steps per generation. This means that each “generation” sees at most a few thousand training tokens, which is extremely small compared to realistic fine-tuning scales for modern LLMs. Consequently, the observed trends in “cross-model interaction” may reflect under-trained or noisy optimization dynamics rather than genuine long-term convergence effects. Moreover, the experiments involve only K = 2 interacting models and explore just three discrete values for both $\alpha$ and $\beta$ ({0, 0.5, 1}), providing too coarse a sampling to fully characterize the theoretical phase behavior. These limitations substantially weaken the empirical support for the paper’s broader claims about multi-model ecosystems.
- The paragraph leading with Step 3: Model Updates (lineno 180-188)' is misleading. The paper’s description of “model updates” incorrectly claims that successive generations of models such as GPT-1/2/3/4 and LLaMA-1/2/3 are typically trained by initializing with the previous generation’s weights. Only using the same datasets to train a family of models does not imply directly descend from one another'.
- The proofs contain several typographical and notational issues that impede verification and, in a few places, likely invalidate steps. See more in Questions.
- Please kindly think of Weakness 1. The interaction of different models is a novel problem, while the analysis framework in this paper is a little weak. Could you improve the problem setting and make it closer to the reality?
- There is a grammar mistake in the paper's title. Use \textbf{each other's} instead of \textit{each others'}.
- I have a question on the proof of Lemma 1 (Appendix E.4). In line 1020, the derivation implicitly equates $(\Pi S \Pi)$ with $(I_K \otimes \underline{S})$. These two matrices are not equal. $(\Pi S \Pi)$ is a dense block matrix (specifically $\frac{1}{K}(J \otimes \underline{S})$). $(I_K \otimes \underline{S})$ is a block-diagonal matrix. If the equation holds for the spectrum norm, I'd appreciate it if you can provide a detailed calculation.
- Typos. Here I list several obvious mistakes and please proofread the whole paper to improve the quality of presentation.
- Line 276: its most recent parameter estimate $\hat\theta_{t−1,k}$ instead of $\hat\theta_{t−1,t}$.
- Line 291. In the rightmost, $y_{t1}$ instaed of $y_{tk}$.
- Different definition of $S_*$ in the main text Section 5 and appendix E. |
Fully AI-generated |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors consider what happens when a collection of models is iteratively trained on a combination of public data, private data, and data generated by all the other models. They begin by arguing that this setup reflects reality by surveying training datasets for a variety of generative AI models. Then, they present a simplified linear regression model for this setting. They derive the bias and variance of the collection of models after $t$ training iterations. They find that the models all converge most efficiently when about half of their initial data is public and about half of their data in future iterations consists of prior generation outputs. This prediction is validated in experiments training OPT and Llama, where each one has a private dataset (SciQ or GSM8k). Models are able to do transfer learning from each others' outputs.
- The overall paper, presentation, and writing quality is high. This is a very well-executed research project.
- The research topic of recursive training dynamics with multiple models is important, timely, and interesting
- The model is well-designed (Figure 1 and Section 4 are great; I wish they had come two pages sooner)
- The experimental setup is clever
- What we learn in the multi-model setting is limited and follows expectations: just as we've seen in single-model collapse with accumulation, but different private data can lead to some transfer learning in the population
- The paper could do a better job of providing the intuitive takeaways from the theorems.
- The paper is fairly verbose and repetitive; the core of the paper doesn't begin until page 5
### Overall evaluation
This is a tricky paper to evaluate, as it's a very high-quality paper, but what we learn feels limited. Perhaps other reviewers will feel differently. This paper definitely deserves to be published and does contribute to the area of model collapse. The structure of the paper could be improved, getting to the contributions more quickly and providing clearer takeaways from the theory.
1. What exactly are the takeaways from the theoretical analysis? We see that under certain conditions, each model's estimate converges to the true parameter. Is Figure 2 the real takeaway from this section?
2. In addition to loss, were there also the same patterns in model accuracy on SciQ and GSM8k?
### Comments
- It's a bit jarring for the paper to go from discussing generative AI for so long and then jump to a linear regression model. The rationale for this simplification makes sense, but should be mentioned/justified earlier in the abstract/intro (e.g., in the context of related work).
- Section 3 has a lot of text, references, and tables for some well-known facts about model training. It could be summarized in a paragraph
- the motivation in the intro for why it matters to study recursive training among multiple models vs just one single model is lacking. The intro just says it has "received little attention," but doesn't explain why it matters that there are multiple models rather than just one. Are there new and different dynamics that occur? Otherwise we could just assume it's basically the same and that studying a single model recursively training is sufficient to understand what happens with multiple.
- Some of the intro and related work was repetitive.
- some in-text citations missing an author (gen, 2022), (app, 2024)
- from the related work, it wasn't clear what it means for all models to have a "bound in error $\pi^2/6$"
- Line 100: "long term effects" is unclear; what is meant is iterative training dynamics, right?
- Line 157: ... this doesn't seem overlooked. It seems like a widely understood fact that many models use CommonCrawl, arXiv, GitHub, Wikipedia, etc
- Figure 1 is very nice. |
Fully human-written |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper describes work on investigating effects of recursive training of model generated data (+ combinations of human authored data) with LLMs. The authors draw motivation for the need for research in this direction by highlighting several realities and overlooked realities such as proprietary LLMs are trained mostly with internet-scraped data and how these datasets overlap substantially. The most important motivation that the authors emphasize is that future LLMs will evidently be trained will LLM-generated (their own) data. The authors back this with existing literature from learning theory on model collapse. In order to further shed light on the potential good or bad consequences of this model-data interaction across tasks, the authors formalize an interactive training pipeline controlled by alpha (fraction of new data per iteration) and beta (public data-private data partition). Results show from a small-scale experiment of two model providers (K=2) derived from Llama and OPT model architectures with t = 15 show that setting alpha and beta to 0.5 (equal partition) seemingly optimal results across tasks (science QA and math QA) compared to other values. Setting alpha to 0 denoting use of purely human-generated data results to degradation in task generalization while setting this to 1 denoting purely LLM-generated data results to equal degradation in the original task.
Overall, the paper does present a simple and understandable method for potentially simulating model collapse and interactions upon training from LLM-generated data but my main issue is centered on grounding experiments with more rigor such as exploring K=10/20/50 or t=50. See feedback below for other concerns with the paper.
The paper proposes a simple yet intuitive method for exploring training data dynamics of LLMs as shown in Figure 1. I found the paper to be fairly readable and the way the paper motivates the need for investigating recursive training from model-generated data to be useful in contextualizing the study and support for the experiments. I believe the model collapse research community may find this paper's results to be beneficial and interesting.
While I appreciate the simplicity and readability of the paper, there are some issues that I found that the authors can use to improve the quality/rigor of the study:
First, the current experiment setup seems shortsighted to me with using only two model providers of K=2. Likewise, in terms of iteration, a realistic scenario would be repeatedly training on model-generated data by a longer margin, say t = 30 or 50 or even 100. The goal is to rigorously investigate how far can the performance converge or if there are possibilities of similar phenomenon like grokking. Model providers these days are extremely fast in releases (almost a new one every 3-4 months), hence I believe a more realistic setup is needed for the study. Likewise, a larger K such as 10 could also be explored to further investigate effects of training from a diverse collection of models.
I believe the paper lacks equal discussion on the implications of training data dynamics that are grounded from the results. For example, the study shows that using a perfect split of 0.5 for alpha and beta seems to produce optimal results but this seems very idealistic and tied to the current experiment setup. More realistic setups might be more nuanced and highly dependent on factors such as training data quality, task diversity, etc. How can the study account for this? This part is underdeveloped in the paper.
I suggest the authors to balance the structure of the paper by prioritizing experiment results. The current paper’s experiments and discussion are both pushed back to the last 2 pages of the paper and feels quite rushed/limited when you read it. The first four pages motivating the challenge could be condensed further to prioritize the results. I would also appreciate more expanded and clear discussion on task generalization as well as this seems underdeveloped as well.
Please improve your references and cite published articles. In the introduction alone, the main citations to support interdisciplinary use case adoption of generative AI are mostly blogs and websites, ignoring more qualified literature or previous works that have been peer-reviewed.
“While some amount of mixing improves model performance on previously-unseen tasks, homogenization occurs for D∗ at all α and for ̃Dk when α > 0—everywhere it can” - this sentence is confusing, can you please clarify and expand this? |
Fully human-written |