ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 4.00 4.00 2236
Fully human-written 2 (67%) 3.00 3.50 3087
Total 3 (100%) 3.33 3.67 2803
Title Ratings Review Text EditLens Prediction
Advancing Multimodal Fusion on Heterogeneous Data with Physics-inspired Attention Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Efficient Hybrid-fusion Physics-inspired Attention Learning Network (EHPAL-Net), a lightweight and scalable multimodal fusion framework designed for heterogeneous biomedical data, such as imaging, multi-omics, and EHR. Its Efficient Hybrid Fusion (EHF) layer sequentially integrates modalities through: 1) Efficient Multimodal Residual Convolution (EMRC) for multi-scale spatial representations, 2) Physics-inspired Cross-Modal Fusion Attention (PCMFA) combining hyperbolic and quantum-inspired attention to model complex cross-modal interactions, and 3) Shared Information Refinement (SIR) for representational diversity. A large number of heterogeneous medical datasets are used to validate the proposed method. - The problems to be addressed, the performance, generalization, and efficiency of multimodal fusion learning, are significant. - The integration of the hyperbolic dual-geometry attention has not been seen in the field of multimodal learning, and it looks novel to me. - The reported reduction in the number of model parameters (98.3%) and FLOPs (97.6) is impressive. - The presentation of the paper could be improved. As a researcher working on multimodal learning, I find it difficult to follow the logic of the paper, which looks to me quite diffuse and redundant. Key ideas such as "physics-inspired attention", "dual-geometry interaction", and "hierarchical structure preservation" are mentioned repeatedly, but I could not find their precise definitions or theoretical justifications. The narrative often cycles through the same claims without further clarification and justification of the mechanisms enabling the desired effects. - The proposed framework claims to be "physics-inspired", but the connection to physical principles appears largely metaphorical rather than mechanistic. The framework design and choice of the submodules are not evidently grounded in physical modeling. - I do not understand the exact meaning of "modality" in this paper. In lines 450, for example, it seems to imply that HAM10000, SIPaKMeD, and PathMNIST are treated as three modalities, each with its distinct data sources and label sets. The source codes provided by the authors seem to confirm my guess. The usual setting of multimodal fusion means that one sample has data of different modalities, e.g., a patient has a CT scan image and a pathological image, where information from these modalities is combined to make a prediction for the sample concerned. I think the exact definition of "modality" in this paper should be explicitly defined, and what the modalities are in each dataset should be clearly detailed. - Please see the weaknesses section above. Fully human-written
Advancing Multimodal Fusion on Heterogeneous Data with Physics-inspired Attention Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposed a physics-inspired attention learning network for the multimodal fusion learning (MFL) task. Specifically, the authors targeted at multimodal tasks in medical context, considering the challenge that different modalities do not share semantics. The key design is their efficient hybrid fusion (EHF) layer, where a physics-inspired attention module is used to capture cross-modality interactions. Different from existing early / late fusion methods, the authors proposed a sequential structure where the input modalities are integrated one by one. The overall architecture is a little complicated, but unique and novel. The authors claimed they solved three main challenges - performance, efficiency, and generalization. But the reviewer do has some questions on some aspects. 1. The proposed EHF layer, especially the Physics-inspired Cross-modal Fusion Attention (PCMFA) module, is both novel and effective. By jointly optimizing cross-modal interactions in hyperbolic and quantum spaces, the network captures rich structural relationships across different modalities. This provides an effective way in combing the advantages of hyperbolic neural networks and quantum neural networks. 2. The performance of the proposed model is supported by the comparison with state-of-the-art single-modal and multi-modal learning methods, and the generalization of the model is demonstrated by datasets from various sources and modalities. 3. The effectiveness of the key components are supported by both theoretical analysis and ablation studies. 1. Concerns on performance comparison. In Table 2 we can see that existing SOTA methods already achieved high performance (with ACC over 98% and AUC over 99%) on many datasets. The concern is whether using those datasets could strongly support the author's claim that the "performance" challenge is solved. More results and datasets like "BRCA" and "ICD9" would be more persuasive. 2. Concerns on efficiency. The authors used two metrics, number of parameters (#P) and floating-point operations (#F) to show and compare the models' efficiency. However, the efficiency of the proposed method is only significant when using Shuffle-Net as backbone. Since most of the compared methods are using ResNet as their backbones, I am concerned whether this would be a fair comparison. Considering the fact that the proposed EHPAL-Net mainly achieved highest performance with a ResNet50 backbone, this issue is even more significant. 1. It is a little unclear how the experiments are designed and evaluated. The focus of this paper is multimodality fusion, which works on combining data from different modalities for the final task. But the results are reported for per dataset, which may cause confusion for readers. 2. Considering the efficiency, it would be better if a fairer comparison could be provided. And it is also recommended to provide more metrics, e.g., running time and memory consumption. 3. I can see that the authors spent a lot of efforts on both model development and the experiments. From the perspective of both a reader and a reviewer, there are too many contents included here, especially as a conference paper. For example, since there are three main components and each of them are constructed by even deeper hierarchies, the readers will need to check different appendices frequently on how each module works and the related ablation studies. The contents of this work seem like a better fit for top-tier journals. Fully human-written
Advancing Multimodal Fusion on Heterogeneous Data with Physics-inspired Attention Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes EHPAL-Net, a multimodal fusion framework designed to improve the efficiency and generalizability of multimodal learning, particularly for medical applications. The method introduces EHF layers that first capture modality-specific multi-scale features and then apply a physics-informed cross-modal attention to model fine-grained, structure-preserving interactions across modalities. Experiments on 15 heterogeneous medical datasets demonstrate improved performance and reduced computational cost compared to existing fusion methods. - The experimental evaluation is extensive, covering 15 heterogeneous medical datasets, which provides strong empirical evidence of robustness. - The proposed method is conceptually novel, combining hybrid fusion and physics-informed attention in a lightweight, scalable architecture. - The claimed challenge that existing methods are “specialized to specific modalities” lacks solid justification. Many recent multimodal frameworks are modality-adaptable and can generalize by substituting modality-specific encoders. - The scope and positioning of the paper are ambiguous: although the title and introduction suggest a general-domain multimodal framework, all experiments are performed solely on medical datasets. The three challenges identified (cross-modal interaction, modality specialization, and computational cost) are generic and not uniquely tied to healthcare AI. So is the method design. The paper would benefit from either (i) reframing the motivation specifically around clinical multimodal learning, or (ii) extending experiments to both clinical and general-domain multimodal datasets to justify broader claims. - The methodology lacks clear conceptual grounding. The design choices are not well-connected to the stated challenges, and lack strong problem-driven justification. - The related work section lacks discussion of more recent and relevant approaches in clinical multimodal fusion, particularly those that explicitly model the interplay between modality-specific and modality-shared representations, which could provide valuable conceptual and empirical comparisons. Minor comment: - The purpose of the citation of paper in line 456, page 9 is unclear. Lightly AI-edited
PreviousPage 1 of 1 (3 total rows)Next