|
A Large-scale Dataset for Robust Complex Anime Scene Text Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on addressing the gap in text detection datasets for anime scenes. Current text detection datasets are mainly designed for natural or document scenes. To fill this gap, the authors propose AnimeText, a large-scale dataset consisting of 735K images and 4.2M annotated text blocks. This dataset is equipped with hierarchical annotations and hard negative samples, both specifically tailored to meet the needs of anime scenarios. For evaluating the dataset, this paer conducts cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results indicate that models trained on AnimeText perform better than those trained on existing datasets when handling text detection tasks in anime scenes, which confirms the dataset's effectiveness in addressing the aforementioned gap.
1) Using current datasets to train text detectors can not address the anime text detection problem well, so constructing a specialized anime text detection dataset is helpful.
2) The scalable annotation workflow is time-consuming and the evaluation is useful for subsequent anime text detection research.
3) Cross-dataset experiments and ablation studies are implemented to demonstrate the effectiveness of the proposed dataset.
1) A baseline anime text detection method is absent in the paper. Only using previous detection model to show the effectiveness of the dataset is not enough. What is the problem that scaling data can not solve?
2) From Tab.5, the F1-Score of anime text detection is superior to that of IC15. Is it overfitting? Or just the training samples work?
See Weaknesses. |
Fully human-written |
|
A Large-scale Dataset for Robust Complex Anime Scene Text Detection |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces AnimeText, a large-scale dataset designed for text detection in anime scenes, aiming to fill the gap between existing scene text detection datasets (mostly focusing on natural or document images) and the unique visual characteristics of anime imagery.
The dataset contains 735K images and 4.2M annotated text blocks, covering five major languages (English, Chinese, Japanese, Korean, Russian). The authors propose a three-stage annotation pipeline combining semi-automatic detection (YOLOv11 pseudo-labels), a CLIP-based hard negative sample classifier, and SAM-assisted polygon refinement. Comprehensive statistical analyses and cross-dataset experiments show that models trained on AnimeText outperform those trained on existing natural scene datasets in anime text detection tasks. Overall, the paper aims to contribute a benchmark dataset rather than a novel model or algorithm.
1. **Clear motivation and relevance.**
The paper identifies a realistic and under-explored problem domain — text detection in anime scenes — which differs visually and stylistically from natural scenes. The motivation is well aligned with the Datasets & Benchmarks track.
2. **Scale and diversity.**
AnimeText is significantly larger than existing datasets (4.2M instances, 735K images) and includes multilingual text with diverse fonts, densities, and artistic styles. This scale can support downstream text detection and OCR research.
3. **Systematic annotation pipeline.**
The three-stage annotation process (YOLO → CLIP → SAM) is well-engineered and reproducible, balancing automation and human verification. Introducing hard negative samples explicitly is a valuable design choice that may improve model robustness.
1. **Incomplete discussion of existing datasets.**
The claim that “no anime text detection datasets exist” is not entirely accurate.
Prior works such as Manga109, Manga-Text-Detection, and other manga-oriented datasets already include bounding-box or polygon annotations for stylized text.
The authors should explicitly differentiate “anime scene frames” from “manga/comic pages” and summarize all related datasets in a comparative table (language coverage, scene type, annotation type, scale, etc.).
2. **Weak justification of domain uniqueness.**
The cross-dataset experiment (training on natural scenes → testing on AnimeText, and vice versa) only demonstrates a domain gap, not necessarily that anime is a “unique” domain.
Any two datasets from different sources would likely show similar degradation.
To claim anime as a distinct domain, the authors should include an additional control experiment (e.g., natural dataset A → B vs. AnimeText → natural) to show the gap is significantly larger.
3. **Insufficient evidence for MLLM relevance.**
Although the introduction repeatedly mentions benefits for multimodal LLMs, all experiments are conducted with traditional detection models (YOLOv11, DBNet, LRANet).
No multimodal or downstream experiments (e.g., OCR recognition, image-text QA) are provided. The claim that AnimeText will “enhance MLLMs” is therefore overstated.
The authors should either include a small-scale demonstration or tone down the claim.
1. Can the authors clarify the difference between “anime scene” and existing manga text datasets (e.g., Manga109)?
Are there examples where Manga109-trained models fail but AnimeText-trained models succeed?
2. How large is the observed cross-domain gap compared to gaps between other natural scene datasets?
Would similar degradation appear when training on TextOCR and testing on COCO-Text?
3. Could the authors show any downstream example (e.g., OCR recognition, subtitle translation, or VQA) to justify the claimed relevance for multimodal LLMs? |
Fully AI-generated |
|
A Large-scale Dataset for Robust Complex Anime Scene Text Detection |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces the AnimeText dataset, the first large-scale dataset (735K images, 4.2M text blocks) for anime scene text detection, addressing the lack of benchmarks in stylized 2D visual contexts.
The AnimeText dataset is the first large-scale benchmark for anime scene text detection (735K images, 4.2M text blocks), filling a key research gap where existing datasets focus on natural or document images. It is substantially larger than prior benchmarks (26× TextOCR, 72× ICDAR19-ArT) and supports multiple languages, including Japanese, English, Chinese, Korean, and Russian. It features hierarchical polygon-level annotations suitable for fine-grained detection and OCR, and a three-stage construction pipeline combining manual labeling, CLIP-based hard-negative filtering (Acc/F1 ≈ 98%), and multi-granularity tagging. Benchmarks across YOLOv11, DBNet, LRANet, and Bridging Text Spotting demonstrate major performance gains (F1 ↑ from 0.008 to 0.855 for LRANet).
1. While B0/B1 annotation levels are defined, the paper does not specify which level is used for cross-dataset evaluation in Table 5. Since AnimeText employs a coarser, block-level labeling scheme, direct comparison with word-level datasets such as ICDAR15 is not strictly fair or interpretable.
2. No controlled experiments are provided to isolate the effects of annotation granularity on model performance (e.g., by splitting B0 regions into word-level annotations or merging ICDAR15 into block-level). Consequently, it is difficult to disentangle performance differences caused by domain gap versus annotation gap.
3. The paper shows limited comparison with structurally similar datasets. Although CTW-1500 appears in the dataset statistics and uses polygonal, line-level annotations (one box per text line, including whitespace), it is omitted from experimental comparisons. Given its structural similarity to AnimeText’s B0 format, CTW-1500 would serve as a more aligned and informative baseline than ICDAR15.
4. The reported precision gains from hard-negative filtering are not accompanied by statistics on mistakenly removed true text instances (false negatives). This omission obscures the impact on recall and overall F1 performance.
5. While the paper notes that AnimeText exhibits higher text density and peripheral text positioning, it stops short of quantifying how these factors affect detection performance (e.g., through density-stratified F1 curves or ablation analysis).
6. All data originate from online anime sources under research-use licenses. The absence of real-world or user-generated images (e.g., screenshots, edited anime, or derivative fan content) limits the dataset’s robustness and generalization potential for practical applications.
7. AnimeText focuses solely on text localization without providing transcription-level annotations. This restricts its applicability for full OCR pipelines, subtitle translation, or vision-language understanding tasks (e.g., VQA).
8. Several contemporary foundation-level or transformer-based text detection models are omitted from comparison. Including such systems would provide a stronger contextual benchmark and clarify where AnimeText stands relative to current state-of-the-art detectors.
1. Which annotation level (B0 or B1) is used for the cross-dataset evaluation in Table 5? If B1 annotations are used, how is fairness ensured when comparing to word-level datasets such as ICDAR15, given the coarser granularity of AnimeText? Conversely, if B0 annotations are used, what specific rules govern multi-line grouping, intra-line whitespace handling, and segmentation consistency across varying text layouts?
2. While the paper reports a +26.9 % increase in precision and improvements in Acc/F1 at the classifier level, it does not disclose the false-negative rate—i.e., the proportion of true text instances mistakenly filtered out. Could the authors provide quantitative results on the percentage of true text erroneously removed and its subsequent impact on recall and overall F1 ? Such information would help assess whether the precision gains come at the expense of text coverage.
3. Can the authors experimentally isolate the impact of annotation granularity? For example: a. Split AnimeText B0 annotations into word-level units and re-evaluate performance. b. Merge ICDAR15 word-level boxes into block-level regions and report mAP/F1 for both settings.
This would help disentangle differences attributable to annotation granularity from those caused by domain variation.
4. Why was CTW-1500 excluded from comparative evaluation? As CTW-1500 provides line-level polygon annotations, structurally similar to AnimeText’s B0 format, it would serve as a more directly comparable dataset than ICDAR15 and help validate cross-annotation consistency.
5. AnimeText consists exclusively of static frames and omits temporal continuity information relevant to anime videos—such as motion blur, scene transitions, or persistent subtitle overlays. This limitation significantly reduces applicability to real-world anime text spotting and temporal OCR pipelines. |
Fully AI-generated |
|
A Large-scale Dataset for Robust Complex Anime Scene Text Detection |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents AnimeText, a large-scale dataset containing 735K images and 4.2M text annotations for anime scene text detection. While the work addresses an underexplored domain with substantial data collection efforts, the paper suffers from limited task scope (detection-only without OCR capability), insufficient experimental validation, and lack of depth in analyzing what makes anime text unique.
1. The dataset addresses the overlooked domain of anime scene text detection, which has distinct characteristics from natural scene and document images.
2. Multi-granularity hierarchical annotations support different application needs.
3. This work represents a substantial data collection effort with 735K images and 4.2M annotations.
#
1. Limited task scope severely restricts practical value: The dataset provides only bounding box annotations without text transcriptions, which prevents end-to-end OCR applications: Modern OCR datasets all provide both localization and recognition annotations. Text detection alone is merely a preliminary task in the OCR pipeline. While the authors acknowledge this limitation, this is a fundamental requirement for a dataset paper in 2025, not an optional extension.
2. Contradicts the motivation of supporting MLLMs: The authors cite Qwen2.5-VL and Gemma 3 to motivate the work, but these VLMs require more than just bounding boxes. Furthermore, models like Monkey-OCR and Qwen2.5-VL have already achieved strong performance on artistic fonts, raising the question: what unique value does a detection-only dataset provide?
3. The paper presents contradictory evidence on task difficulty. Table 5 shows a puzzling result:
> YOLOv11 (ICDAR15 → ICDAR15): F1 = 0.565, mAP₅₀:₉₅ = 0.291
>
> YOLOv11 (AnimeText → AnimeText): F1 = 0.851, mAP₅₀:₉₅ = 0.806
>
The model performs significantly better on AnimeText. The results show that AnimeText may actually be an easier task than natural scene text detection.
4. There is insufficient analysis of anime text as artistic/stylized fonts. The paper repeatedly mentions "handwritten and stylized artistic fonts" as a key characteristic, but provides no comparison with other datasets containing artistic fonts.
1. What is the quantitative comparison with Manga109?
2. How does anime text differ from other artistic text domains? |
Lightly AI-edited |