ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (33%)	4.00	3.00	3126
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (33%)	8.00	4.00	2291
Fully human-written	1 (33%)	4.00	3.00	6847
Total	3 (100%)	5.33	3.33	4088

Title	Ratings	Review Text	EditLens Prediction
Tell me Habibi, is it Real or Fake?	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces ArEnAV, the first large-scale Arabic–English code-switching (CSW) audio-visual deepfake dataset, consisting of 387K videos (765+ hours) with intra-utterance CSW, dialect variation, and multiple manipulation types. The authors propose a multi-stage generation pipeline combining GPT-4.1-mini-based transcript manipulation, four TTS systems, and two diffusion-based lip-sync models. The paper benchmarks several state-of-the-art deepfake detection and localization models and shows a drastic performance drop when evaluated on ArEnAV, demonstrating the dataset’s difficulty and relevance. A user study further confirms that humans also struggle to detect these deepfakes (≈60% accuracy). The dataset and code are promised to be released. 1. Existing deepfake datasets are monolingual or multilingual but lack intra-utterance code-switching. The paper clearly identifies this gap and addresses it convincingly. 2. Large-scale, well-engineered dataset. 387K videos, 4 TTS + 2 lip-sync models, stratified splits, strong statistics, and detailed generation pipeline. The dataset is significantly larger and more diverse than prior multilingual datasets. The authors show that state-of-the-art models (e.g., BA-TFD, LipForensics, Capsule-v2) perform poorly, even close to random guessing on some settings, demonstrating real difficulty. 3. Human performance ≈60% accuracy, poor localization ability → confirms that deepfakes in code-switching settings are hard even for humans, not only for models. 1. The paper does not propose any new detection model or algorithm. I feel the work as “engineering + dataset release” rather than a scientific advance. 2. Heavy reliance on closed-source models (GPT-4.1, Whisper, TTS-1, etc.). Reproducibility is partially limited. If OpenAI APIs change, future users may not be able to regenerate the dataset. This may be flagged in the reproducibility checklist. 3. Although CSW is the main motivation, the paper lacks deeper linguistic validation: ⦁ Is the LLM-generated code-switching natural vs synthetic? ⦁ How does the CSW distribution compare to real-world corpora? ⦁ Does GPT-4.1 make linguistically plausible switching decisions? 4. Real vs fake imbalance is acknowledged but not studied. No experiments showing how class imbalance affects model learning. Meanwhile, generalization to other multilingual settings not demonstrated. 1. Could the authors clearly articulate which parts of the pipeline are technically novel, and whether it is reusable beyond this dataset? 2. If these APIs change or become unavailable, can the dataset still be regenerated? 3. Did the authors run any linguistic validation (human or automatic) to ensure the generated CSW resembles real corpora like ZAEBUC or ArzEn? 4. Can the authors provide quantitative evidence on how different TTS/lip-sync components affect detectability or quality? 5. How does this imbalance affect model training? Did the authors experiment with balancing, reweighting, or sub-sampling? 6. Can the authors comment on whether the pipeline could scale to other CSW settings (e.g., Hindi-English, Spanish-English)?	Fully AI-generated
Tell me Habibi, is it Real or Fake?	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors in the paper point out the lack of a large-scale multilingual dataset that includes both English and Arabic, particularly one featuring code-switching between the two languages. They emphasize that code-switching is highly prevalent in daily speech across the Arabic world. To address this gap, they introduce the first large-scale Arabic–English audio-visual deepfake dataset, which features intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It contains 387k videos and over 765 hours of real and fake videos generated by the up-to-date SOTA model. - It will be released as an open-source, large-scale bilingual dataset. Given that Arabic is spoken by hundreds of millions of people all over the world, the dataset holds significant importance. - The data generation pipeline was clearly described in the paper, enabling easy reproducibility. - The quality of the generated fake data was comparable with the well known dataset AV-Deepfake1M, as evaluated by standard metrics. 1. Insufficient Direct Experimental Evidence for the "Code-Switching" Contribution The paper's central contribution is its focus on a multilingual and code-switching (CSW) dataset. However, the experimental results presented in Tables 8, 9, and 10a do not directly prove that this code-switching characteristic is the key factor driving the dataset's difficulty. The authors demonstrate that existing models perform poorly and attribute this failure to the novelty of CSW. Specifically, the zero-shot performance drop in the temporal localization task (Table 8) is a weak argument. To my knowledge, models like BA-TFD were not designed for zero-shot generalization to new domains, languages, and generation methods. Their failure on this cross-domain dataset is expected and does not isolate the impact of code-switching itself. 2. Absence of "Close-Inspection" and Qualitative Analysis Connecting to the first point, the paper lacks a "close-inspection" of why the models fail. The analysis relies heavily on aggregate metrics (AUC, AP), which only show that models fail, not the cause. To sophisticate the claim that CSW and multilingualism are the core challenges, a deeper analysis of the fine-tuned models is necessary. The authors could provide qualitative, frame-level examples of incorrect predictions, But with current state, paper cannot clearly answer the question below. Do the models consistently fail at or around the code-switched regions? Or are the errors more correlated with the high-quality diffusion-based lip-sync, specific audio perturbations, or other artifacts? It is difficult to disentangle the source of the dataset's complexity. It is unclear whether this dataset is challenging because of its CSW properties or because of the careful, high-quality generation techniques and perturbations the authors have induced. 3. Use of Outdated Benchmark Models The choice of models for benchmarking (Meso4, MesoInception4, Xception) is a notable weakness. These are significantly outdated models that are widely known to overfit specific dataset artifacts and lack generalization capabilities. While BA-TFD is included for the temporal task, the overall detection model suite is quite naive. Relying on these older architectures makes it difficult to assess if the reported performance drop is due to the dataset's genuine complexity or simply the known limitations of these models. The benchmark would be far more compelling if more recent, state-of-the-art detection models with demonstrated generalization capabilities and evaluated. Minor: The result for LAA-Net listed under the DFDC dataset in Table 10b is incorrect. That specific performance metric was obtained on the DFDC-P (DFDC Pre-processed) dataset, as noted in the original citation by the authors. (See: https://github.com/10Ring/LAA-Net/issues/3) 4. The protocol of the data generation pipeline is similar to the approach used in generating fake data for AV-Deepfake1M. No novel architecture for fake data generation was introduced. 5. The authors did not conduct an intra-dataset evaluation to assess how code-switching affects the accuracy of temporal localization or deepfake detection models within the same domain. Instead, they evaluated models which were trained on a dataset containing a small number of Arabic videos (presumably without code-switching) and then tested on the proposed ArEnAV dataset. However, this is across domain evaluation, so the drop in accuracy is expected. 6. The dataset also shows limited instruction following in code-switching scenarios; the authors relied on the GPT model to generate code-switching transcripts, making real and fake transcripts too similar and not always changing their meaning. 7. The motivation is sound, but the authors have not considered the models such as Diff2Lip and LatentSync, are trained on English speaking videos, and the natural lip-sync to real-life Arabic speaking phonemes are different than English, including the speed of which Arabic is spoken. The authors should specify the limitations and rationale of using this. 8. In Table 10 training protocol is not described. As fake part can be anywhere in the video, so in this experiment the split of video clips, is not defined, also it is not written how much length of video is used to test? Similarly, the details of the experiment should be disclosed for fair comparison. 9. The authors used 4 commercial audio generation methods, while for visual manipulations the authors used only 2, which are also not commercial. Visual manipulation is harder to detect; hence diversity can help more generalizable deepfake detection. 10. 3.2.2 Audio generation pipeline is very messy, hard to comprehend, a flow chart could help understand this better. 11. Figure 1 text is very small can be improved, also the fonts used for Arabic can be improved for better readability. 12. Although language overlap is addressed, the identity split is not defined. 13. Meta data information is not provided. 14. The dataset contains imbalance samples, which may introduce bias while training. 1. Why was there no intra-dataset evaluation for the necessity of code-switching manipulation? 2. Could you provide a few qualitative examples (e.g., video frames or audio spectrograms) of your fine-tuned models' failure cases? We need to see where the models are failing. Are the incorrect predictions concentrated around the temporal boundaries of the code-switch? Or are they failing due to other artifacts (like the lip-sync) that are unrelated to the CSW 3. Table 10a shows comparison with the proposed partial fake dataset with full fake dataset, is it fair comparison? 4. Did the author make the dataset identity disjoint? Also does the dataset contain real and fake pair information in metadata?	Fully human-written
Tell me Habibi, is it Real or Fake?	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work introduces ArEnAV, a large-scale Arabic-English audio-visual deepfake dataset, designed to address a critical gap in current research: the challenge of detecting deepfakes within multilingual and code-switched content. The work's core contributions are twofold: 1) the ArEnAV dataset itself, the first large-scale benchmark featuring intra-utterance code-switching and dialectal variations; and 2) a novel data generation pipeline that leverages Large Language Models for content manipulation and integrates SOTA TTS and lip-sync models to generate high-fidelity forgeries. Comprehensive benchmark results highlight the dataset's challenging nature, demonstrating a significant performance drop in state-of-the-art models. These findings validate the limitations of current detectors that are predominantly trained on monolingual data. 1. This work successfully tackles an important and overlooked problem: detecting audio-visual deepfakes in code-switched (CSW) speech. This is a major step towards building deepfake detectors that work in the real world. 2. This work proposes ArEnAV, a new large-scale dataset for this task. The pipeline used to create the data is novel and combines several SOTA models, providing a valuable new resource for the research community. 1. The primary evaluation metric (AP@IoU=0.5) may be poorly suited for the dataset's extremely short, single-word forgeries. 2. The "TTS and insert" audio generation method can create unnatural splice artifacts, which may affect the dataset's validity. These artifacts could allow models to detect forgeries using simple audio errors rather than the intended code-switching cues, thus misrepresenting the true nature of the detection challenge. 1. The paper notes that the LLM did not always change the meaning in the "meaning + translation" mode. Did you try to fix this? If not, how do you know these samples did not lower the dataset's overall difficulty and affect your final conclusions? 2. Could you provide the average, minimum, and maximum duration of the fake words in your dataset? This information is very important. Without it, we cannot be sure if the poor model performance is because the task is hard, or because your evaluation metric (AP@IoU=0.5) is simply too strict for such short fake clips.	Lightly AI-edited

PreviousPage 1 of 1 (3 total rows)Next