|
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
6,567 image posts from Instagram related to the anti-feminicide movement in Mexico were collected and analyzed to see if these models can group the pictures into meaningful topics (like protest signs, solidarity posts, info posters). Human oversight is also provided to make sense of such groups or clusters.
- The social problem is very relevant as we need to study what is going on in our society and how anti-feminicide discourse is prevalent online.
- Above 16,000 image posts from Instagram are taken and studied, which is a big number.
- The usage of multiple vision algorithms like (ResNet50, CLIP, BLIP-2) and multiple clustering methods gives both depth and breadth to the study.
- There is a good attempt to connect automated clustering with human oversight which helps sensemaking of clusters effective.
- Detailed discussion like acknowledgement of challenges of dense-clusters and text-rich images.
- The methodological contribution is very minimal. The paper applies a few algorithms with clustering techniques. There is no new method or any novel grounding on the modeling side. Evaluation is very descriptive rather than quantitative.
- The clustering gives negative silhouette scores, and the authors claim highly useful clustering performance.
- There is no comparison to anything. Like other clustering algorithms (k-means?). There is no usage of multimodal pretrained models that are predominantly tuned in social media data.
- The work is very exploratory. There is no hypothesis-driven research. There is not much scientific takeaway from the paper for ICLR audience.
- The paper has overemphasized the social sciences. It may be valuable socially as it tackles important social problems but for a machine-learning venue, the contribution is so limited.
- Why did you conclude "best separation" when all silhouette scores remain negative?
- Did you try any OCR + text embeddings? The images have very dense text. While CLIP/BLIP could handle text + image, did you not try posing the problem differently and aim for a better representation?
- Why did you use only a few clustering approaches? |
Fully human-written |
|
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper deals with the problem of analyzing images shared on social media platforms. More specifically, images related to a specific topic are collected from a single social media platform, and image features extracted from standard image encoders are grouped using DBSCAN, a standard clustering algorithm in data mining and database management. Extracted clusters are checked and investigated by human inspections, which reveal that different image encoders yield different clusters and thus provide different semantic groups.
S1. The research topic dealt with in this paper is significant. Social media has become one of the most influential media platforms, and its significance continues to grow day by day. In this sense, analyzing the characteristics and dynamics of media content distributed on social media platforms is one of the most significant research topics for understanding the shifts in social conditions and public opinion.
W1. If my understanding is correct, the main topic of this paper belongs to social science, not computer science. In this sense, I strongly recommend this paper to be submitted to other conferences related to social science, such as ICWSM and CHI. ICLR focuses on fundamental theories and innovative technologies for machine learning, placing a high priority on theoretical and/or technical novelty.
W2. On the other hand, discovering and demonstrating novel findings with already known techniques is also valuable for healthy development in computer science. However, papers focusing on this aspect should provide extensive investigations from various viewpoints and attempt to address nearly all questions derived from the original research question and the experimental results. See e.g. [Teney+ CVPR2024 https://openaccess.thecvf.com/content/CVPR2024/html/Teney_Neural_Redshift_Random_Networks_are_not_Random_Functions_CVPR_2024_paper.html]. From this viewpoint, the current paper seriously lacks deep investigations for the research question.
W3. The organization should also be majorly revised. For example, this paper devotes excessive space to explaining well-known techniques. Meanwhile, almost all the experimental results required for describing the main story are placed in the supplementary material.
Q1. I could not understand the reasons why the authors chose the anti-feminicide movement as research material. This description is required for understanding the philosophy of this paper and the research question of this paper and checking the existence of ethical issues. |
Fully human-written |
|
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper analyzes 16,567 Instagram images from the anti-feminicide movement in Mexico using unsupervised and self-supervised embedding models combined with HDBSCAN clustering. The authors employ human-in-the-loop content analysis to evaluate cluster quality and understand visual messaging structures. The results show dense, overlapping clusters across all models, with CLIP achieving the best separation metrics.
- The paper provides a thorough comparison of three embedding approaches and combines quantitative clustering metrics with qualitative human annotation of 185 sample images.
- The application to anti-feminicide social movement messaging represents a meaningful use case for computer vision methods. The dataset of 16,567 Instagram posts provides a substantial corpus, and the human-in-the-loop analysis reveals real evaluations that the VLMs cannot find.
- The paper applies existing, well-established methods in an off-the-shelf manner. There are no new proposals for model development.
- How do you justify claiming CLIP is "best" when DBI of CLIP is worse than that of ResNet50 or all the methods showed the negative Silhouette scores? |
Lightly AI-edited |
|
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the use of computer vision-based image embeddings and clustering for analyzing social movement messaging within a large set (16,567 posts) of Instagram images related to the anti-feminicide movement in Mexico. The study extracts feature embeddings using ResNet50, CLIP, and BLIP-2, applies HDBSCAN for clustering, and evaluates clusters with several quantitative metrics alongside human-in-the-loop inductive content analysis. The results compare representational properties across models and discuss overlap and nuance in image messaging structure, highlighting the strengths and limitations of current representation models for domain-specific, topic-coherent social images.
1. The paper addresses a relevant, underexplored problem at the intersection of machine learning, social movements, and computational social science, providing quantitative insight into the structure of visual messaging in a humanitarian context.
2. It adopts a comparative framework, using both popular (ResNet50, CLIP) and more advanced (BLIP-2) image embedding models, allowing a nuanced analysis of their clustering behavior on real-world activist imagery.
3. Employing multiple established clustering evaluation metrics (Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index), in tandem with human-in-the-loop content analysis, demonstrates methodological rigor and brings valuable qualitative depth to quantitative findings.
1. While ResNet50, CLIP, and BLIP-2 are established models, the rationale for selecting these, particularly why not use more recent or task-specialized models (e.g., multimodal sentiment/abusive meme detectors), is underspecified. There is also little reflection on how text-in-image (e.g., hashtags, slogans) is handled beyond embedding, although text is central to social movement images.
2. No non-deep-learning clustering baselines (e.g., classical SIFT/ORB features, PCA+GMM, or manual curation) are reported for context. Similarly, the study provides limited statistical testing to gauge whether any performance differences (or cluster count/size differences in Tables/Appendices) are meaningful, or merely artifacts of parameter tuning.
3. While the analysis (Section 3 and Figures 2–4) is a valuable complement, the use of coding to identify cluster validity serves more as a post hoc rationalization than a systematic validation, potentially overfitting the interpretation to noisy clusters. For instance, the claimed “nuance” within dense groups could mask model or clustering failures. More rigorous validation (possibly co-clustering, cross-validation with held-out hand-labels, or even crowd-sourced validation as secondary annotation) is missing.
4. Given the consistently negative Silhouette Scores (Table 1), what measures (quantitative or qualitative) can you provide to justify that your clusters are not artifacts of parameter tuning, but capture semantically meaningful differences? Would alternative clustering/objective functions mitigate the issue of dense overlap?
5. Can you provide any ablations or error analysis comparing embeddings and clusters for images dominated by text versus those that are mostly visual? How do ResNet50, CLIP, and BLIP-2 differ in treating such cases?
As shown in Weakness. |
Heavily AI-edited |