|
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a deep learning framework that is designed to operate on multiple single-cell data modalities. It solves the problem of alignment of single cells across modalities as well as the problem of translating between modalities. The method uses an InfoNCE loss for alignment, and uses a discrete codebook to improve interpretability. Extensive empirical experiments suggest that the method outperforms various state-of-the-art competitors.
The proposed model includes several components (the VQ module and the mutual information module) that are well motivated and seem to provide significant improvements relative to the state of the art.
A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc.
A major problem with this paper is that the exposition is difficult to follow. For example, the second paragraph of the introduction fails to clarify exactly what problem you are working on. Indeed, by describing multimodal protocols that assay multiple aspects of the same single cell, I was misled about what tasks you are interested in solving. What would help is a precise, formal description of the problems you are addressing. More generally, I found the text very difficult to follow. It would be better if you carefully defined terms before using them. Below I outline some of the questions that arose as I worked through the manuscript.
In general, I think a missing piece here is assessing how well these models generalize beyond the specific data set they are trained on. I think that each model is trained and validated on splits of the same data set (though I don't know for sure, because you don't tell us how this is done). So a reasonable question is whether you can apply the trained model to a new, independent dataset, generated from a different type of cell. The multimodal alignment methods mentioned at the start of Section 2 work directly in such a scenario, whereas a trained model like yours inherently has to worry about generalizability. In practice, to be useful your model has to generalize to single-modality data (i.e., I only measured scRNA-seq, and you tell me what the corresponding scATAC-seq would look like). A discussion of this issue, and some experimental characterization of it, would substantially strengthen the paper.
I thought your description of the challenges associated with multi-modal data (lines 43-49) was imprecise and not very informative. For example, what does it mean to say that there are "substantial discrepancies" between scATAC-seq and scRNA-seq? They measure entirely different things. To my mind, the fact that there are differences in feature spaces is not a "challenge" per se; it's just definitional. You wouldn't say that multimodal analysis of text and images is "challenging" because pixels don't look like words, right?
I don't actually believe your claim (line 55) that if you don't embed data into a shared space, then you "cannot fully exploit potentially complementary information across modalities." This is a very bold claim that requires substantial evidence. Indeed, I don't know how you could conclusively prove such a claim.
I am not convinced that *mean* FOSCTTM is the most useful measure. Have you considered computing a p-value for improvement of the FOSCTTM? You get a FOSCTTM score for each cell, so you could do something like a sign test.
In the related work section, the fact that alignment methods "suffer from poor alignment robustness when handling noisy [data]" is not a substantive critique, in my opinion. All methods degrade in performance in the presence of noise.
I do not understand the critique (line 104) of methods that do multimodal reconstruction without relying on a shared embedding space. You say that "their utility for tasks requiring direct cross-modal comparison, querying, and label transfer can be limited." Why? It's pretty straightforward to do, e.g., label transfer with an accurate multimodal reconstruction method: just reconstruct from one space to the other and then use nearest neighbors to transfer. There is no reason you have to do nearest neighbors in a latent space. I think this critique is misguided or needs to be explained much more carefully.
I found the text in lines 144-149 difficult to understand. For example, what is the difference between "modality-specific features" and "semantic characteristics"? What do you mean by the "bounds of MI"? Similarly, the sentence at lines 162-164 is not grammatical. I'm also confused about what it means to be "insufficient for effectively decoupling ... in a directed manner" (lines 167-168).
I wish you had introduced your assumption (line 184) earlier, since it seems to be important to understand the basis of much of this work. I guess this is what you were alluding to when you talked about "modality-specific features" versus "semantic characteristics."
In the description of the datasets, you should indicate what previous papers used these datasets for benchmarking, and indicate what paper you extracted results from (unless you ran all the tools yourself, in which case indicate that).
I was surprised that all the talk about mutual a bound on MI ultimately seems to boil down to just doing an InfoNCE alignment loss.
Minor:
line 192: uses -> use
line 270: objection -> objective
You should delete the sentence at line 293 ("Single-cell multi-omics data are often hindered by complex and sophisticated techniques, low throughput, and high noise levels."). Just say what data you used. It doesn't even make sense to say that data is hindered by something.
Incidentally, I think calling cross-modal translation "reconstruction" is misleading, since reconstruction typically refers to starting and ending from the same place; e.g., reconstructing a scRNA-seq profile from a masked or compressed version thereof. I do recognize that other papers in the literature use "reconstruction" to mean "translation."
Did you compute the performance measures in Tables 1-4, or were some of these taken from previous publications? If the latter, did you use the same cross-validation splits?
How was train/test splitting done for each dataset? |
Fully human-written |
|
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper propose a new method for cross modality integration and alignment. The methods focusing on scRNA and scATAC data integration are already well studied, and thus it is hard to figure out the main contributions of this paper to this field.
The framework is clearly presented.
I have several questions or concerns regarding the current model design and model performance. I think these challenges preclude the paper from publication in this conference, at least in this format.
1. What is the unique contribution of this paper? Using the VQ-based method for multi-omic data integration or biological data learning has already been studied in several papers (https://www.nature.com/articles/s41540-020-00158-2, CVQVAE, or scBeacon). This method lacks innovation, and the training design is not very appealing.
2. The motivation is not so well established. The central dogma only allows one-directional information flow, and thus, we do not need to model the bidirectional information. RNA can never come back to chromosomes, and thus, this method lacks biological interpretation.
3. The benchmarking result is also very weird. Why can we find some baselines with variance reported, but others not? The authors should unify the presentation mode and provide variance for every model. Moreover, reconstruction in single-cell multi-omic data analysis is not a useful metric, as the expression profiles always have noise. The authors should consider one or two new tasks to perform the evaluation. I recommend the authors' reading: https://www.nature.com/articles/s41592-025-02856-3 for including more baseline methods.
4. The comparison should be fair. The authors need to tune hyperparameters for all methods to ensure a fair comparison.
5. I can not find the information about the data scale. Are all the testing data on a large scale or a small scale?
6. How about applying the method to integrate proteomic data such as CITE-seq? Since the authors do not model noise, this framework should work well.
Please see the weaknesses. |
Fully human-written |
|
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces scCMIA, a self-supervised framework designed to address the challenges of integrating single-cell multi-omics data, particularly focusing on cross-modal alignment between scRNA-seq and scATAC-seq modalities. The key innovation lies in leveraging mutual information (MI) principles to decouple modality-specific and semantic features within a unified discrete latent space using a VQ-VAE architecture. The proposed method aims to mitigate information loss during integration by combining intra-modal decoupling (via CLUB-based MI minimization) and inter-modal alignment (via contrastive learning with InfoNCE loss).
1. The integration of MI bounds for intra-modal decoupling and cross-modal alignment is theoretically grounded
2. The paper provides a rigorous evaluation across multiple datasets and tasks (alignment, reconstruction, clustering, label transfer).
1. My main concern is the novelty of this work. The proposed framework is a patchwork of existing techniques, and shows no insights or benefits for the community.
2. While four datasets are used, they primarily focus on well-studied protocols (e.g., 10x Multiome). Broader validation on more complex tissues or rare cell types would strengthen generalizability.
3. The paper lacks comparison with cutting-edge approaches like scButterfly or graph-based methods beyond GLUE. Including these would better contextualize scCMIA’s advancements.
Please see the weaknesses. |
Heavily AI-edited |
|
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces multi-modal alignment between scRNA (single-cell RNA-sequencing) and scATAC (single-cell Assay for Transposase-Accessible Chromatin using sequencing) data using a VQ-VAE (Vector Quantized Variational Autoencoder) architecture.
The justification for the modeling based on Mutual Information is well-established.
Limited Novelty
- The justification based on Mutual Information has been thoroughly explored in previous research (e.g., the CLUB paper).
- Techniques like VQ-VAE are all existing methods.
- Are there specific challenges unique to single-cell data, and does the paper introduce a corresponding novel technique to address them?
Decoupling Explanation: More explanation is needed regarding decoupling.
- Why is decoupling necessary?
- Consideration is needed on how the decoupled representations could be used independently if required.
Applicability to Uni-modal Data: The method was only applied to single-cell multi-modal data. Does it have utility for uni-modal data as well? Showing that the method performs well even on uni-modal data through experiments could further justify the use of multi-modality in the model.
See weakness section |
Moderately AI-edited |