|
A quantitative analysis of semantic information in deep representations of text and images |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper works to measure where and how semantic information emerges within large models for language and vision. They use information imbalance (II) which is defined as an asymmetric proxy for mutual info. II quantifies how much one representation (like a sentence, its translation, etc) predicts another. The authors identify broad semantic layers where cross linguistic and cross modal alignment peak in DeepseekV3 and Llama. They find that in these layers semantic info is distributed across many tokens, correlated over long ranges and exhibits causal asymmetry. Similar analysis of vision transformers show that there are analogous semantic regions and asymmetries with text and image representations. Overall, this study provides qualitative evidence that large models converge toward shared internal semantic spaces across both languages and modalities.
The work finds that layer wise analysis identifies where semantics arise in networks, which reveals that meaning is distributed across many tokens and layers.
The paper tests a variety of models, spanning multilingual LLMs, two vision architectures, and multimodal text image pairs, w/extensive validation and controls like shuffled data
This work also supports that larger models exhibit broader and stronger semantic alignment, empirically supporting the Platonic Representation Hypothesis
This body of work is rather interesting and supports work like PRH. However, there is limited methodological novelty since the contribution seems to be mainly analytical. II metric is borrowed and not newly developed. So the work’s strength lies in empirical breadth rather than algorithmic innovation.
Further, the II values are not intuitive to understand and some readers may struggle to relate these numbers to concrete semantic similarity or task performance (0.2 ≈ 20% shared features).
One other note is that the observations like casual asymmetry or token correlation are also mainly shown for autoregressive models and translation data. This feels unclear if it would hold for encoder only or multimodal generative systems. This type of restricted generality is one of the weaker parts of the work.
While a variety of models are tested, most comparisons are only between two model scales eg 8b vs 671b which would limit conclusions about scaling trends or arch dependence.
One last minor note is with presentation, the paper is occasionally heavy on technical detail without enough intuitive interpretation of figures or implications.
Could the authors further elaborate on why II was chosen over more standard representational similarity measures like CKA, cvcca, etc.? II seems to provide an asymmetric view, which is helpful, but would be helpful to know if the location of the ‘semantic layer’ would be the same under a symmetric similarity measure. ie do the layers with minimal II also correspond to peaks in mutual similarity index? Clarifying this would also help cement or strengthen that these layers are truly special, and not simply an artifact of the particular metric. This would also provide additional interpretability of the absolute II values.
A followup is, how precisely is the semantic region defined? Is it by a threshold or by visual inspection of II minima? Could there be the possibility of multiple disjoint semantic pockets?
It is not clear whether these effects generalize to encoder-only or non-autoregressive models like BERT, XLM. Could the authors verify?
Is it possible the discovered semantic alignment is exploited, like by stitching models or improving multimodal transfer? |
Fully human-written |
|
A quantitative analysis of semantic information in deep representations of text and images |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper applies the Information Imbalance metric to study where semantics emerge in large text and vision models. While technically sound and supported claims, the work offers limited novelty. The contribution is clear, but could benefit from stronger contextualization and benchmarking against other methods.
1 - Using information imbalance to capture asymmetric predictivity between representation spaces is a good fit for the paper’s goals and is well justified. The paper includes sanity checks and reference experiments to interpret II values.
2 - A solid and reproducible experimental setup is employed, with the II metric applied systematically to multiple large models in various configurations.
3 - The identification of token-distributed semantic encodings, long-range correlations in interior layers, and cross-modal asymmetries is interesting and relevant.
1 - The paper lacks connection to prior work; while it positions its goal against existing work, it doesn’t compare to any previously introduced methods, correlation measures, probing tasks or feature insights. The omission weakens the claim that this is a novel contribution.
2 - The main observations (semantic information concentrated in middle layers, language-agnostic representations emerging with scale) echo prior findings. The paper does not convincingly show what new understanding II provides beyond known cross-lingual or multimodal analyses.
3 - The authors equate low II with “semantic information” without validating it against a downstream measure of semantics (e.g., translation accuracy, retrieval alignment, or semantic probing). This conceptual step needs empirical justification.
4 - The insights, while valuable, do not sufficiently support the claim “converge towards shared representations of the world”, as correlations and causation are different.
5 - The figures show inconsistencies; no error bars are visible, despite being described in the legend.
1 - Can you show that II correlates with established semantic metrics such as cross-lingual retrieval accuracy or probe performance?
2 - Why not compare directly with previous work on semantic measures across languages to base the interpretations?
3 - Can you clarify in what sense II reveals new structures that existing analyses could not?
4 - Have you checked whether tokenisation differences across languages/models materially affect the II minima? |
Moderately AI-edited |
|
A quantitative analysis of semantic information in deep representations of text and images |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a quantitative analysis of semantic information in deep representations of text and images using the Information Imbalance metric. The authors examine how Large Language Models (LLMs) and vision transformers encode semantic content across different modalities, languages, and architectural layers. Key findings include the identification of "semantic layers" where language-transferable information is concentrated, the observation that semantic information is distributed across multiple tokens with long-range correlations, and the discovery of significant model-dependent asymmetries in information content between image and text representations.
**Overall assessment**
This paper makes valuable contributions to understanding semantic representations in LLMs by applying the Information Imbalance metric. However, the work would benefit significantly from addressing the presentation issues and expanding the experimental scope to support the broader claims.
* The paper identifies inner "semantic" layers containing the most language-transferable information, significantly extending existing research in the field. The authors examine these inner semantics across tokens and input modalities, revealing significant model-dependent asymmetries between representations.
* The paper presents interesting and suggestive findings across two models as well as across multiple language pairs. These findings are promising and lay the groundwork for more systematic exploration across languages, model sizes, and model families.
* The use of the Information Imbalance metric provides an asymmetric measure of relative information content, which is well-suited for comparing representations across different architectures and modalities.
* The article lacks polish in several areas, which impacts readability. Issues include misformatted citations, mentions of error bars that are not visualized, and inconsistent use of abbreviations across figures.
* Multiple strong claims are not supported by experiments on only two models of different sizes and architectures. For example, the section titled "CORRELATIONS BETWEEN TOKENS AS A HALLMARK OF QUALITY IN SEMANTIC REPRESENTATIONS" (Anonymous, p. 6) and the claim "We further ascertained that, on these 'semantic' layers, long token spans meaningfully contribute to the representation, and long-distance correlations in encoded information cue high-quality representations." (Anonymous, p. 9). These would benefit from more systematic exploration across model sizes within the same family.
* The exploration across languages relies primarily on a limited set of language pairs. The heterogeneity claims (Section 3.1.1) would be strengthened by more systematic sampling across languages with varying levels of representation in training data.
* The paper would benefit from a more systematic exploration of model size as an independent variable. LLama3.1-8B and DeepSeek-V3 vary on multiple parameters beyond size, making it difficult to isolate the effect of scale on semantic representation quality.
**Suggestions**
* Please format all citations correctly according to the conference style guide.
* Consider writing out "Information Imbalance" in full rather than using the abbreviation "II," especially given its inconsistent use across figures.
* Figure 2: Error bars are mentioned in the caption but are not visible in the figure. Please either make them visible or revise the caption.
* I would write "opus_books" as "opus books" or "OPUS Books" rather than with an underscore. You already use the citation and appendix D for an exact reference
* The statement "proving that semantic information is not concentrated in the last tokens, but spread over many of them" (Anonymous, p. 5) makes too strong a claim. Consider revising to "demonstrating that" or "showing that" instead of "proving that."
* The conjecture "We conjecture that this difference is due to the greater online presence of the Spanish language relative to Italian Lan (2025), and the consequent difference in the amount of training data." (Anonymous, p. 5) could be examined more systematically across languages based on their presence in Common Crawl or similar corpora. This would strengthen the paper's empirical foundation.
* Consider adding experiments that:
- Compare multiple models of different sizes within the same family (e.g., Llama3.1 at 8B, 70B, and 405B)
- Sample more systematically across languages with known differences in training data representation
- Test whether the identified patterns hold across additional model families |
Moderately AI-edited |
|
A quantitative analysis of semantic information in deep representations of text and images |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Information Imbalance (II), a directional and asymmetric metric designed to assess the semantic alignment between representations across languages, modalities, layers, and tokens. The authors utilize II to identify “semantic layers” in both language and vision models, revealing how semantic information is distributed across layers and tokens. The study is empirically grounded on translation pairs from OPUS Books and image-caption pairs from Flickr30k, comparing representations from DeepSeek-V3 (a large mixture-of-experts model) with those from LLaMA-3.1-8B, as well as from DinoV2 and image-GPT.
This paper presents a clearly structured and methodologically consistent investigation of semantic structure in large models. The use of Information Imbalance as a unifying probe is appealing and the findings are generally well-motivated. However, the experimental scope is currently too narrow to support some of the broader claims, and the dependence on a single metric and representation choice reduces the robustness of the conclusions.
1. The topic is timely and of broad relevance to the community, particularly as interpretability and quantitative understanding of large models become increasingly central. In particular, the paper is motivated by a fundamental and well-posed question: where, how, and to what extent semantic information is encoded within and across model modalities.
2. The analysis confirms and consolidates several important intuitions about semantic structure in large models, including the presence of mid-to-late semantic layers, asymmetries in token dependencies, and layerwise cross-modal alignment.
1. The paper’s central question is ambitious, but the empirical evidence is too limited to support it.
The work aims to explain where and how semantic information is encoded across models and modalities, yet the experiments cover only two language models (DeepSeek-V3 (MoE, 671B) and LLaMA-3.1-8B (dense, 8B)) and a single multimodal dataset (Flickr30k). Without intermediate scales, architecture controls, or broader datasets, the conclusions about model size, semantic bands, and cross-modal generality remain under-supported and potentially confounded.
2. The definition of “semantic information” is overly tied to a single metric (II) and representation choice. The operationalization of II relies on binarized activations and small-k neighbor overlap, which may neglect richer structural information such as real-valued distances or rank-order consistency. Although the supplemental material includes one real-valued comparison, the overall analysis lacks rigorous metric ablation or validation using rank-sensitive alternatives such as NDCG or Kendall’s tau. Overall the choice can be acceptable, but it needs side-by-side validation to acknowledge the method’s limits.
3. Several findings largely reiterate existing observations, and the related work review is insufficiently comprehensive.
Prior research has already shown that mid-to-late layers encode more transferable and semantically structured features, that larger models exhibit stronger long-range token dependencies, and that cross-modal alignment often emerges at specific intermediate layers. While applying the II metric as a probe for information content is novel and conceptually interesting, the overall advancement over these established insights remains limited. Moreover, the discussion of related work does not sufficiently situate the paper within this literature, making it difficult to assess the true scope of novelty. A more thorough review and positioning of prior probing and representation studies would help clarify the paper’s contribution.
1. The comparison between DeepSeek-V3 and LLaMA-3.1-8B conflates scale with architecture. Have the authors considered evaluating a family of dense models (e.g., LLaMA 7B, 34B, 70B or Qwen 7B to 72B) to assess whether the widening of semantic bands correlates with model size independent of MoE structure?
2. All cross-modal results are derived from Flickr30k. Can the authors evaluate whether the trends in II persist on more diverse and challenging datasets, such as COCO or web-scale datasets like CC3M or CC12M?
3. Can II be linked to downstream performance metrics? For instance, are lower II values predictive of improved retrieval performance (e.g., Recall@k, NDCG) in multilingual or image-text tasks? Can II be used to guide pruning or layer selection for distillation?
4. To what extent do the observed directional asymmetries depend on decoding directionality versus architectural factors? Have the authors tried bi-directional or masked LMs to test whether left-to-right asymmetry persists? |
Fully AI-generated |