|
Libra-Emo: A Large Dataset for Multimodal Fine-grained Negative Emotion Detection |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces Libra-Emo, a large-scale, multimodal corpus designed for the detection of fine-grained negative emotions in videos. The authors expand the conventional, coarse-grained emotion taxonomy (e.g. Ekman's six basic emotions) to include 13 categories, paying particular attention to distinguishing eight nuanced negative emotions (e.g. frustration, despair, hatred). The dataset comprises 62,267 video clips (61,625 for training and 642 for testing) sourced from YouTube and annotated using a collaborative active learning strategy involving humans and machines. The authors evaluate leading multimodal large language models (MLLMs) in zero-shot settings and after instruction tuning on Libra-Emo. They demonstrate that fine-tuning significantly improves performance in both in-domain (Libra-Emo Bench) and out-of-domain (DFEW) evaluations.
1) Fine-grained recognition of negative emotions is crucial for applications such as mental health support and content moderation, yet it is an area that has not been widely explored in multimodal learning.
2) Libra-Emo is the largest video emotion corpus, containing eight distinct negative categories and surpassing prior work in both size and granularity.
3) The paper includes zero-shot, fine-tuned and out-of-domain evaluations.
1) The authors cite Wikipedia as the primary source for defining thirteen emotions. This is clearly insufficient for scientific work of this level. There are no references to, or descriptions of, recognized psychological models.
2) Although Libra-Emo Bench employs the votes of eight annotators and a threshold of four, the article does not offer any quantitative metrics of agreement. Without this information, it is unclear how the annotation's reliability can be assessed, even in a test set.
3) Even with MLLM label correction, there is no guarantee that there will be no variants with persistent biases, particularly in subjective categories such as 'hateful' versus 'angry'.
4) Although audio is formally included in the corpus, most open-source models are only tested on V+T. There is no more detailed analysis of modalities.
5) The authors clearly indicate in experiments that only video and subtitles should be used. This contradicts the stated 'multimodal' focus and essentially renders the audio component decorative.
6) The Appendix states that 2,549 samples were discarded in Round 0 due to 'quality issues', but provides no details on the criteria used to remove these samples or any examples.
7) The qualifications of the annotators and whether they had undergone training are not entirely clear. For instance, the qualifications of professional psychologists and students differ.
8) No analysis has been conducted to determine whether the errors are related to the characteristics of the data (e.g. cultural differences in emotional expression, subtitle quality or audio noise).
9) The reasons for choosing frustrated, despairing and hateful, and for excluding guilty, ashamed and anxious, which are also important in applications such as mental health, are not explained.
1) Why were these particular negative emotions selected? What psychological theory justifies this taxonomy beyond Wikipedia?
2) How can one draw generalized conclusions about the value of audio without this analysis?
3) Could you please provide the inter-annotator agreement metrics for the Libra-Emo Bench?
4) How many samples required human adjudication in Round 0 compared to Rounds 1-2?
5) How did you ensure that the initial model-generated labels didn’t introduce systematic bias, given MLLMs’ low zero-shot accuracy?
6) What specific criteria were used to define 'quality issues' and lead to the discarding of 2,549 samples in Round 0?
7) What qualifications did the annotators have? Had they received training in emotional psychology?
8) How was the mapping of the 13 emotions to DFEW validated? Might it inflate WAR/UAR scores? |
Fully human-written |
|
Libra-Emo: A Large Dataset for Multimodal Fine-grained Negative Emotion Detection |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Libra-Emo, a multimodal dataset containing 62K short video clips labeled with 13 discrete emotion categories, including 8 negative emotions (e.g., frustrated, hateful, despairful). The authors argue that existing emotion recognition datasets are too coarse and that negative emotions are underrepresented. They use an active-learning pipeline with LLM-assisted annotation and explanation generation, claiming improved recognition performance and generalization to external datasets.
While the topic is relevant, the contribution is incremental and poorly grounded. The proposed taxonomy lacks theoretical validity, the dataset overlaps heavily with existing benchmarks, and key state-of-the-art works are ignored. The methodological novelty is minimal, and the psychological reasoning behind the label design is weak.
Large-scale data curation effort: 60K+ multimodal clips, collected and annotated systematically.
The first attempt focused on negative emotions.
Experiments show some empirical consistency.
1. The main concerns are the weak psychological and conceptual foundation
The 13-class taxonomy is not theoretically grounded in any established emotion framework (e.g., Plutchik’s wheel, Shaver’s hierarchy, or OCC appraisal model). According to the emotion wheel theory, there are more than 40 negative emotions.
But why do authors choose specifically 8 emotions, the design details and motivations are unclear.
There are plenty of important negative emotions missing in this regard, e.g., bitter, awkward, and awful.
Besides, if you examine it carefully, category overlap is severe: Sad and Despairful should be in category; the difference is intensity variation, not a distinct category. Frustrated and Angry only differ by controllability. Ironic is a communicative style, not an emotion.
No Valence–Arousal–Dominance (VAD) validation, appraisal mapping, or inter-annotator reliability (κ/α) is reported.
Therefore, the dataset’s construct validity is questionable.
2. Missing Key Baselines and Model Comparisons
The literature review in the paper is insufficient. The paper omits all major contemporary models in multimodal emotion understanding and efforts addressing the issues of fine-grained emotion recognition:
Important MLLM baselines for MER recognition, such as EmoVIT (CVPR 2024) and AffectGPT (ICML 2025), are not referred.
EmoSet (ICCV2023) tries to use different attributes to achieve emotion reasoning with LLMs via detailed datasets.
OVMER (ICML 2025) enables open-vocabulary emotion reasoning and continuous affect generation with more than 200 labels, which enable the community to generate rich, fine-grained emotion descriptions, see MER 2025 challenge results.
3. Experimental results
The experimental results can prove that the reliability of the proposed emotion categories is doubtful: "our model performs poorly in distinguishing between despairful and sad, as well as hateful and angry." Because they are in the same emotional category. This can be addressed only when valence and arousal are introduced.
4. Methodological and Experimental Deficiencies
The so-called “active learning + explanation generation” pipeline is not novel—it replicates standard LLM-assisted annotation practices (already used in EmoSet and AffectGPT etc.).
The benchmark size (642 videos) is too small for robust evaluation; it's only 1% of the training set, which can be affected by noise easily.
No ablations on modality contributions (video, audio, text), explanation effects, or hierarchical label structures.
No cross-dataset evaluation to demonstrate transferability.
Please carefully address the concerns above. I will adjust my rating accordingly. |
Lightly AI-edited |
|
Libra-Emo: A Large Dataset for Multimodal Fine-grained Negative Emotion Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
To enhance the model’s sensitivity to fine-grained negative emotions, this paper introduces and annotates a fine-grained negative emotion dataset named Libra-Emo. In this dataset, the authors provide a more detailed categorization of negative emotions, expanding the traditional seven emotion classes to thirteen. Both zero-shot and fine-tuning experiments demonstrate that the Libra-Emo dataset effectively improves model performance. Furthermore, the results indicate that models capable of recognizing thirteen fine-grained emotion categories also achieve superior performance on the classic seven-category emotion classification task, thereby validating the effectiveness and rationality of the proposed fine-grained negative emotion taxonomy.
1. The authors propose a practical human–machine collaborative annotation framework and, based on this approach, construct a fine-grained multimodal emotion dataset comprising approximately 63K samples.
2. Comprehensive comparative and ablation experiments were conducted on the annotated dataset. Under both zero-shot and fine-tuning settings, the results demonstrate that models trained with the Libra-Emo dataset achieve significant performance improvements, thereby validating the effectiveness and reliability of the proposed annotation methodology and the dataset itself.
1. Some aspects of the comparative experiment design are suboptimal. In the zero-shot experiments conducted on the DFEW dataset, the results for the baseline models Qwen-2.5-Omni-7B and InternVL-2.5-8B are missing. Since Qwen-2.5-Omni-7B already performs well on this dataset, the absence of these baselines somewhat undermines the completeness and persuasiveness of the experimental comparison.
1. It is recommended that the authors consider releasing the corresponding baseline code alongside the open-sourced dataset, so that the research community can better reproduce the experimental results and build upon this work. Do the authors have any plans for such a release?
2. From the distribution of emotion-class data, it is evident that the sample sizes across categories are imbalanced. Whether any sample-balancing strategies were employed during training is not addressed in the paper.
3. The scraped TV and movie clips may involve copyright and privacy issues. |
Lightly AI-edited |
|
Libra-Emo: A Large Dataset for Multimodal Fine-grained Negative Emotion Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Libra-Emo, a large-scale multimodal dataset designed for fine-grained negative emotion detection in video. The authors propose a refined taxonomy of 13 emotion categories, including 8 distinct negative emotions, and construct two subsets: Libra-Emo Trainset (61,625 samples) for instruction tuning and Libra-Emo Bench (642 samples) for evaluation.A human–machine collaborative active learning strategy is employed for annotation, and extensive experiments are conducted on leading Multimodal Large Language Models (MLLMs), both in zero-shot and fine-tuned settings. Out-of-domain evaluation is also conducted on DFEW.
1. The paper addresses a significant problem with existing emotion recognition corpora: the limited granularity of emotion labels and the underrepresentation of negative emotions. To solve this problem, the authors expand Ekman's basic emotion categories to 13 more specific ones, including 8 categories for negative emotions, and propose the Libra-Emo corpus, which contains a large number of 62,267 samples.
2. The author proposed an active learning framework that combines model voting with targeted human verification. This strategy helps reduce annotation costs while improving label quality iteratively. Additionally, the authors synthesized label-consistent natural language explanations. The ablation studies confirmed that this boosts performance.
3. The corpus evaluation covers several aspects: (1) zero-shot performance of closed-source and open-source MLLMs on the Libra-Emo Bench; (2) instruction tuning across different model scales and architectures; (3) modality ablation; (4) data scaling analysis.
4. The authors evaluate fine-tuned models in a zero-shot setting on the DFEW corpus.
1. Only 24,764 out of 61,625 samples (approximately 40.2%) were manually chaked during active learning, with 13,000 being labeled in Round 1 and 11,764 being labeled in Round 2. Approximately 59,8% of the samples retained the model's consensus labels without human verification, which raises concerns about the reliability of the labels for fine-grained distinctions.
2. The dataset was derived from 385 source videos. However, the paper does not provide information about: (1) the number of unique speakers; (2) speaker overlap between the Trainset and the Bench; (3) distributions of age, gender, and ethnicity. These limitations make it difficult to assess speaker independence and potential demographic biases. The validity of the experimental design is questionable.
3. The annotation prompt instructs the labeling of "the emotional label expressed by the people", but no protocol is provided for dealing with conflicting emotions among multiple visible individuals. Given that the clips require faces in more than 99% of frames, it is likely that multi-person scenes will be included, but the consistency of the labels remains unclear.
4. The term “Libra-Emo” is used for both the corpus and the fine-tuned models (e.g., Libra-Emo-8B). This conflates data and model artifacts. Model names should explicitly reflect their base architecture (e.g., InternVL-2.5-8B-Libra) to avoid confusion.
5. The zero-shot evaluation of the DFEW dataset (Table 7) includes only one specialized emotion recognition model, Emotion-LLaMA, as a baseline. This limited the ability to fully understand the advantages of the Libra-Emo fine-tuned model. To get a more comprehensive comparison, it would be necessary to include additional emotion-specific models in the evaluation.
1. What proportion of the 61,625 Trainset samples were initially labeled by humans in Round 0 (i.e., lacked model consensus)? Can you provide the exact number?
2. How many unique speakers are in Libra-Emo? Is there speaker overlap between Trainset and Bench? Can you share demographic distributions (gender, age, ethnicity)?
3. When multiple individuals express different emotions in a clip, what annotation protocol was followed? Was a specific person (e.g., main speaker) targeted?
4. Would you consider renaming the fine-tuned models to clearly distinguish them from the dataset?
5. Can the authors extend their zero-shot DFEW comparison to include other emotion-specific models? |
Lightly AI-edited |