|
To the Best of Trust: Full-Stage Trusted Multi-modal Clustering |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FSTMC, a multi-modal clustering method aiming to model and utilize three types of uncertainty (aleatoric $\sigma^2$, epistemic $\tau$, and predictive $u$) in a "full-stage" manner to enhance trustworthiness and robustness1. While the paper addresses an important problem and reports strong empirical results on some metrics, I have reservations regarding its technical soundness, methodological novelty, and experimental validity. The core methodology suffers from a circular dependency, key technical descriptions (especially regarding the VAE) are misleading, and the experimental results (specifically the ablation study) contradict the paper's central thesis. My recommendation is Reject.
The paper is relatively clearly written.
The paper's contribution seems derivative, as it primarily consists of an amalgamation of established methods rather than offering significant innovation.
The core methodology has a severe technical flaw and circular dependency. The authors' proposed "virtuous cycle" is logically circular. Point 1: The Aleatoric Uncertainty Loss ($L_{AU}$) relies on pseudo-labels that are filtered by the predictive uncertainty $u$. Point 2: The predictive uncertainty $u$ is generated via DS evidence fusion. Point 3: This DS fusion process is modulated by a "reliability gate" $r^m$. Point 4: This gate $r^m$ is defined as $r^{m}=\exp(-(\sigma^{m})^{2}-\tau^{m})$, which directly depends on the aleatoric uncertainty $\sigma^2$. In short: the learning of $\sigma^2$ (via $L_{AU}$) depends on $u$, but the calculation of $u$ depends on $\sigma^2$. This is an invalid circular argument. The authors fail to explain how this loop is computationally unrolled, making the core mechanism technically unsound.
The paper's description of its VAE-based method is misleading. The paper claims to "adopt a VAE encoder" for $L_{AU}$. However, the defined loss function $\mathcal{L}_{a}=\frac{1}{(\sigma^{m})^{2}}CE(...) + log(\sigma^{m})^{2}$ is not a VAE loss. A VAE objective must include a reconstruction loss (from a decoder) and a KL divergence term. The paper mentions no decoder, and its loss function is completely different, resembling an aleatoric uncertainty loss for classification (e.g., Kendall & Gal, 2017) rather than a VAE.
The paper makes misleading claims about being "hyperparameter-free". This is entirely false. The pseudo-label threshold $\epsilon=0.6$ is a critical hyperparameter. The MC Dropout rate $\zeta$ is a hyperparameter. The authors even dedicate Section 3.3 and Figure 3 to analyzing it. The number of MC passes T. The 1:1 weighting in $r^m$ is an implicit, fixed hyperparameter choice. These claims are highly misleading.
The use of $L_{EU}$ is questionable. The paper uses $L_{EU}$ to "penalize high-uncertainty predictions" and suppress "abnormally high uncertainty". This is counter-intuitive to Bayesian principles, which aim to quantify uncertainty, not penalize it. Forcing the model to be confident on difficult samples may reduce trustworthiness, not improve it.
The visualization is insufficient. The t-SNE plot in Figure 5 is standard but uninformative for a paper on uncertainty. The authors should have provided t-SNE plots colored by the learned uncertainty values ($\sigma^2$, $\tau$, or $u$) to visually validate that the model correctly identifies ambiguous or noisy samples.
I don't have other questions. |
Lightly AI-edited |
|
To the Best of Trust: Full-Stage Trusted Multi-modal Clustering |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses a critical yet often overlooked issue in the field of multimodal clustering (MMC): end-to-end trustworthiness. The authors propose a unified framework that models three core types of uncertainty, such as aleatoric, epistemic, and predictive, and integrates them coherently throughout both the representation learning and clustering optimization stages. This is a novel and conceptually meaningful approach that carries substantial theoretical and practical significance. From a technical standpoint, the paper skillfully combines Variational Autoencoders (VAE), MC Dropout, and Dempster–Shafer Theory, providing a sound modeling basis for the quantification and utilization of each type of uncertainty.
- The overall structure and organization of the paper are well-designed and logically presented.
- The motivation is clearly stated and easy to follow, effectively highlighting the significance of the work.
- The figures are visually appealing, with color schemes that enhance readability and overall presentation quality.
- The experimental section is thorough and well-executed, offering convincing empirical support for the proposed framework.
- The proposed model integrates multiple components (e.g., VAE and MC Dropout, which requires T forward passes), implying that its training cost may be higher than that of some baseline methods.
- In the description of L_EU, the paper mentions a “one-sided Huber robust”, but its explicit mathematical formulation is not provided in the main text. It would be beneficial to include a clear and intuitive explanation or a simplified expression of this penalty term in the main body of the paper to help readers better understand its role and behavior.
- It would be highly beneficial to include a visualization of the different types of uncertainties—namely aleatoric, epistemic, and predictive—to illustrate their respective values across representative samples. Such a figure would help readers intuitively understand how the model distinguishes among these uncertainties and why it assigns specific uncertainty levels to different modalities or data instances.
- In Section 2.5, the paper states that “…we first map its latent representation … to an evidence e^m….” This evidence serves as the input to the entire evidence fusion module. However, the specific form of this mapping function is not clearly described. Is it implemented as a simple linear layer, or as a MLP with a nonlinear activation function (e.g., Softplus, to ensure the evidence values remain positive)?
- The paper empirically demonstrates that the joint modeling of all three types of uncertainty yields the best performance. However, it lacks a deeper theoretical or intuitive analysis explaining why these three components are indispensable. Why are all three uncertainties necessary?
- What are the potential risks if only aleatoric uncertainty is considered?
- Conversely, what issues would arise if only epistemic uncertainty were taken into account? |
Lightly AI-edited |
|
To the Best of Trust: Full-Stage Trusted Multi-modal Clustering |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a novel full-stage trusted multi-modal clustering method, namely FSTMC, a framework that integrates three types of uncertainty, including aleatoric, epistemic, and predictive, throughout the entire pipeline of representation learning, evidence fusion, and pseudo-label optimization, thus establishing an end-to-end trustworthy clustering paradigm. Generally, the proposed FSTMC method gives some new insights, theoretically well-founded, and extensive experimental results demonstrates significant performance improvements across multiple benchmark datasets.
1. This paper proposes the novel “Full-Stage Trusted” paradigm, extending uncertainty learning from the representation learning stage to the clustering and pseudo-label optimization stages, thereby achieving an end-to-end trustworthy constraint.
2. The study introduces model uncertainty into the variational autoencoder structure, enabling trustworthy modeling at the latent representation level and enhancing the reliability of feature learning.
3. The model shows insensitivity to dropout rate parameter setting, maintaining stable convergence and high clustering accuracy across different configurations.
4. Extensive experiments on multiple large-scale multi-modal datasets demonstrate that the proposed method significantly outperforms existing approaches in terms of clustering accuracy, validating its superior performance and broad applicability.
5. Two types of ablation studies are conducted to comprehensively demonstrate the advantages of computing and utilizing uncertainty within the framework.
1. The selection basis for the pseudo-label confidence threshold \varepsilon is not explained.
2. The paper does not explore the model’s performance under modality-missing scenarios, which limits understanding of its behavior in incomplete multi-modal environments.
3. Some symbols are not consistently defined upon their first appearance, affecting the readability. Also, the text in some figure annotations (e.g., Figure 3) is too small, making it difficult to read.
1. How is the pseudo-label confidence threshold \varepsilon selected, and can a detailed description or experiment be added in the paper to demonstrate the rationality of its choice?
2. Is uncertainty estimation performed within each modality, and what are the advantages of doing so?
3. Does any correlation exist among the three types of uncertainty, and how can it be interpreted?
4. Can this framework be extended to other tasks, such as classification, and if so, how should it be implemented? |
Lightly AI-edited |
|
To the Best of Trust: Full-Stage Trusted Multi-modal Clustering |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a method for clustering data that comes in different modalities. The method is intended to account for aleatoric and epistemic uncertainty in each modality, and to combine modalities in a principled way while accounting for such uncertainty. The key steps are roughly as follows:
1. Each modality is first encoded using a VAE, i.e. each sample of each modality is mapped to a Gaussian of a certain mean (the feature representation) and variance (interpreted as the "aleatoric uncertainty"). A loss $L_{AU}$ is used that is essentially a kind of pseudo-clustering objective that tries to regularize appropriately by the AU.
2. The variance in the feature representation across multiple MC-dropout samples is interpreted as the "epistemic uncertainty". A loss $L_{EU}$ (based on a Huber loss) is used to penalize feature representations that are excessively unstable.
3. After such encoding, a Deep Divergence Clustering (DDC) approach is used to "soft cluster" the embeddings separately per modality. This results in $M$ soft clusterings, where $M$ is the number of modalities.
4. Each soft clustering generated by DDC (one per modality) is mapped to an "evidence" vector. These are then combined across modalities into a Dirichlet distribution using Dempster-Shafer theory. This essentially yields the final multimodal clustering.
Experiments on various multimodal datasets indicate that this method performs well as compared to prior methods when measured in terms of clustering accuracy and normalized mutual information.
Using Dempster-Shafer theory to combine information across modalities (by viewing each modality's clustering as providing one form of "evidence") seems like an interesting and novel idea. The empirical performance of the method also seems quite favorable at least on the datasets tested.
The paper is unfortunately quite challenging to follow and suffers from a lack of conceptual clarity, with the presentation emphasizing engineering complexity over scientific insight. It combines a variety of techniques (VAEs, pseudolabeling, MC-dropout, Dirichlet evidence modeling, reliability gating, Dempster–Shafer fusion, Deep Discriminative Clustering, and many more) without a clear unifying principle or justification. The roles and interactions of these components are insufficiently motivated, and mechanisms such as the reliability gate appear somewhat ad hoc. Moreover, it is not always clear what models are being used in different components and what the learnable parameters of the whole system are.
Overall, while the empirical results are strong, the paper does not manage to communicate the essential scientific insights in an accessible way, and I cannot recommend acceptance in this current state. In particular, the work would benefit from a clearer formulation of the underlying modeling assumptions, a simplified core mechanism, and a more focused and systematic presentation that keeps unnecessary complexity to a minimum. Also, the paper needs to provide a much more detailed comparison with related work.
One of the biggest things missing from the paper is a clear discussion of the modeling assumptions and the meaning of the various kinds of uncertainty. Specifically:
- Aleatoric uncertainty is always a property of the true data distribution, not of the method. What links the estimated aleatoric uncertainty (i.e. the variance of the VAE output) to true aleatoric uncertainty (e.g. noise in the input, etc)? Why would noisy inputs have high variance? This requires a discussion of the modeling assumptions and what the "true" aleatoric uncertainty is under those assumptions.
- Epistemic uncertainty being estimated via MC-Dropout variance is also insufficiently motivated. The original MC-Dropout line of work considered Bayesian models, where there is a full posterior over predictions. As far as I can tell, this work does not use a Bayesian network for the VAE. So what is the form of epistemic uncertainty that is being modeled? Why is MC-Dropout a useful proxy for it? (I suspect a more useful term for it is merely "representation stability", although I am again not sure why we care about stability under random perturbations induced by MC-Dropout.)
- Typically, predictive uncertainty is a combination of aleatoric and epistemic uncertainty. However, in this paper it seems to be an orthogonal kind of uncertainty, and in fact is never formally defined. Can the authors provide a formal definition and discuss what exactly is being captured by it, and how it relates to AU and EU?
Some other questions:
- What exactly is the $L_{E}^{(m)}(x)$ term used in Eq 9? It is never formally defined. How does it relate to $\tau(x)$ defined in Eq 8?
- In Eq 10 and 11 and other places, what is $C$? Is it the same as $K$, the number of clusters? Is the number of clusters known a priori?
- What exactly are the different models and the associated learnable parameters in the whole system? The loss function is simply written down as an algebraic expression without making it clear what the parameters are.
- Minor: Why does the predictive uncertainty component of the loss (Eq 16) have subscript RU instead of PU? |
Fully human-written |