ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 5.00 3.50 3582
Fully human-written 2 (50%) 9.00 4.00 2834
Total 4 (100%) 7.00 3.75 3208
Title Ratings Review Text EditLens Prediction
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work addresses both data and model heterogeneity in Personalized Federated Learning (PFL). The authors propose FedMosaic, a framework that jointly mitigates these challenges through two core components: RELA and PQ-LoRA. RELA (Relevance-guided Aggregation) constructs client-specific global models by weighting updates based on task relatedness, enabling effective knowledge sharing among similar clients while reducing interference across unrelated tasks. PQ-LoRA introduces dimension-invariant low-rank adapters whose parameters depend only on rank $r$, allowing efficient and architecture-agnostic knowledge sharing across heterogeneous models. To more accurately capture real-world task heterogeneity and distribution shifts, they further introduce DRAKE, a comprehensive multi-modal federated learning benchmark. 1. Experimental results demonstrate consistent improvements over strong baselines, indicating the effectiveness of the approach. 2. The appendix further provides a thorough and extensive suite of experiments, supporting the validity and robustness of the reported findings. 3. The paper is generally well-written and clearly organized, making the technical ideas easy to follow. 1. The comparison against prior works using non-IID splits of a single dataset may not be entirely equitable. The contextual settings differ significantly (maybe the earlier studies targeted models specialized for single-domain or unimodal tasks, rather than fine-tuning billion-parameter foundation models). Moreover, the motivation for exploring multi-modal tasks in PFL requires further clarification. What are the practical or deployment-oriented use cases where clients naturally possess distinct modalities? At present, the setup appears somewhat hypothetical, with each client operating on different data and architectures. In such a scenario, the incentive for federated participation is not clearly articulated. 2. The novelty of RELA is not fully evident. The client-wise gradient update formulation $\hat{g}_i^{(t)} = (1 - \alpha) \hat{g}_i^{(t-1)} + \alpha g^{(t)}_i$ closely resembles a first-order exponential moving average (EMA), similar to adaptive optimization methods such as Adam. Furthermore, the addition of a sanitization or noise component introduces privacy-related implications that warrant more rigorous analysis. If differential privacy–like noise is applied, the paper should evaluate its robustness against gradient-based privacy attacks and report accuracy trade-offs with and without the noise injection. 3. The paper attempts to address multiple orthogonal challenges simultaneously (data heterogeneity, model heterogeneity, privacy), which can dilute the focus of the contribution. A clearer ablation or modular analysis could help isolate the effects of each component. Currently it's unclear how impactful the "computing gradients at every $m$ batch is" or how impactful (accuracy-wise) the sanitized gradients are. 1. In the related work section, most PFL citations are listed without discussion. It would be helpful to briefly summarize the current state of the field: What approaches do recent state-of-the-art methods adopt, and what limitations does FedMosaic specifically address beyond them? 2. The preliminaries conclude with a PFL objective, but the formulation of the global model objective is unclear. How does the given objective differ from the standard local objective, and why does it include terms dependent on other clients’ models? 3. In Equation (1) and Figure 2, are the gradients computed with respect to the frozen weights $W_s$? 4. The rationale for computing only the last-layer gradient (based on the proportionality of preceding gradients via the chain rule) requires further justification or empirical support. Are there results related to it in the appendix? 5. The paper mentions that gradients $g_i$ are computed every $m$ batch iterations rather than every batch. What is the observed accuracy trade-off with and without this optimization? 6. For PQ-LoRA, does the method assume that all clients use the same low-rank dimension $r$? If so, the approach still enforces a degree of architectural homogeneity. Given that the core challenge is model heterogeneity, how can we justify $P$ and $Q$ modules remaining dimensionally the same? 7. How are the LoRA parameters $A$ and $B$ trained? Lightly AI-edited
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles personalized federated learning (PFL) under realistic heterogeneity: data heterogeneity where each client has distinct multi-modal tasks with temporal shifts, and model heterogeneity where clients use different model families and sizes. It introduces FedMosaic, which combines relevance-guided aggregation and PQ-LoRA to enable selective knowledge sharing and cross-architecture parameter sharing. The authors also release DRAKE, a multi-modal PFL benchmark with 40 tasks spanning VQA, visual reasoning, visual relations, including multi-image inputs and unseen task evaluation under distribution shifts. Empirically, FedMosaic outperforms strong baselines across heterogeneous/static/dynamic and cross-family settings on Self (personalization) and Others (generalization), and improves fast adaptation on unseen tasks; ablations show both RELA and PQ-LoRA contribute meaningfully. - Well-posed problem + realistic setup. The paper motivates that most PFL work oversimplifies heterogeneity; here, clients differ in both data and model (families and depths/sizes), which is closer to practice (agentic AI, device constraints). - Clear algorithmic design. The proposed method comes with clear motivation and design solution accordinly. For instance, RELA computes client-wise gradients on a small frozen model. It applies EMA decay to track shifting client knowledge and also adds Gaussian noise + random subsampling for privacy and bandwidth. - Benchmark contribution. DRAKE covers multi-modal, multi-image tasks, temporal shifts, and unseen evaluation; the table contrasts prior FL benchmarks along these axes. - Strong and granular evidence. The experiments are comprehensive with different settings like heterogeneous (same-family) PFL and cross-family heterogeneity, also analyize the performance from per-client view and fast adaptation possibility. Detailed ablations about adding PQ-LoRA improves Others, adding RELA further lifts Self/Others, are also provided. While RELA applies EMA, noise, and subsampling to the last-layer gradients from privacy perspective, an explicit comparison to baselines with stronger privacy guarantees would strengthen the claim. - I'm curious about the task relavance between the tasks in your DRAKE benchmark, like showing the relavance matrix. Fully human-written
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims to address the challenges of personalized federated learning in scenarios where both data and models are heterogeneous. The authors proposed a framework named FedMosaic, which consists of two core components: the RELevance-guided Aggregation (RELA) and PQ-LoRA. RELA is a task-relevance-based model aggregation strategy that constructs customized global models for clients. PQ-LoRA is a module shareable across heterogeneous models, addressing differences in model depth and dimensions through "block-wise aggregation" and "weight alignment". Additionally, the authors propose DRAKE, a comprehensive multimodal federated learning benchmark that covers 40 different tasks and simulates real-world task diversity and temporal distribution shifts. Experiments on both multi-modal and text-only benchmarks demonstrate that FedMosaic outperforms PFL methods in both personalization and generalization. Overall, this paper represents a meaningful problem in personalized federated learning. The authors find that existing personalized federated learning methods are still confined to simplified scenarios with highly homogeneous data and models across clients, while real-world scenarios are more complex. They proposed FedMosaic, which addresses the simultaneous heterogeneity of data and models through a task-correlation-aware model aggregation strategy and dimension-invariant modules. Additionally, they introduced DRAKE, a comprehensive multimodal federated learning benchmark. **Major Weaknesses:** Overall, this paper has some merits, but there are a few weaknesses that stop me from giving a higher rating. My major concerns are as follows. (1) The paper mentions that the FedMosaic method does not require high computational costs, and the authors' experiments indeed include sections related to computational costs. However, the process of weight alignment in PQ-LoRA seems to be relatively complex, and the paper does not provide information about the computational costs of this part. (2) Section 4.2.1 of the paper mentions using CKA to find relative depth alignment and demonstrates with Llama-1B and Llama-3B, but lacks sufficient explanation regarding the applicability of this method. (3) In the weight alignment in the PQ-LoRA section, it mentions freezing the smaller model as a pivot and updating the larger model. The paper lacks an explanation for why this strategy was adopted. (4) DRAKE is one of the contributions of the paper, but extensive details are provided in the appendix, with relatively limited space allocated in the main text. **Minor Weaknesses:** (1) Figure 2 provides an overview of FedMosaic, but the image is relatively dense and slightly lacking in readability, so it could be adjusted a bit. Please clarify my concerns in the weakness part. Lightly AI-edited
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper considers personalized federated learning (pFL) of multimodal large language models (MLLMs) under realistic scenarios that involve not only data heterogeneity, but also model architecture and model family heterogeneity, and task diversity. The paper designs a new method called FedMosaic that enables FL style collaboration across clients even in the simultaneous presence of all of the heterogeneities. FedMosaic has two important components - RELA (Relevance-guided Aggregation) and PQ-LoRA (Dimension-invariant Low-Rank Adaptation) that respectively address (data, task) and model heterogeneities. In an effort to make evaluation more realistic, the paper also introduces a new benchmark called DRAKE that incorporates all these heterogeneities and further includes aspects like dynamic distribution shifts and unseen tasks. Extensive experimental evaluation is provided in the paper for FedMosaic as well as several other state-of-the-art pFL baselines, which establish superior characteristics of FedMosaic. (S1) The writing and presentation of the paper are very clear in terms of both algorithm design and experimental results. Adequate intuitions are provided throughout the paper and appendices. The problem is well motivated, related work is well cited, and the contributions are contextualized appropriately. (S2) FedMosaic is an original and interesting solution to a very complex practical problem of multiple heterogeneities in pFL of MLLMs. This is a significant contribution to the field in terms of both ideas and solutions. RELA and PQ-LoRA would likely find use in other problems too. (S3) The supporting experimental evidence provided in the paper and appendices is quite exhaustive and impressive. The paper undertakes a wide diversity of studies to establish the characteristics of FedMosaic from several angles and shows competitive or improved performance w.r.t. all compared baselines. (W1) Introducing a new benchmark in an algorithms paper is counterproductive. The benchmark would be difficult to discover for any reader. To a reviewer, the benchmark's design is impossible to evaluate when only 10 lines can be allocated to it in the main body. While DRAKE looks extremely useful, there are several nuances which can only be understood by carefully reading multiple sections in the appendices. My opinion is that DRAKE should be submitted as a separate datasets & benchmarks style paper for it to be properly peer-reviewed as such. (W2) There is no benchmark called HFLB in (Chen et al., 2024). The name/citation should be corrected. (Q1) Section 4.2.1 Figure 4, and Appendices A.12, A.17: Even though supporting empirical evidence is provided, I don't understand why layer correlations should exist across model families (Llama, Qwen, etc.). Is this exclusively caused by the common training data source from which $D_P$ is sampled? PQ-LoRA would only work if such correlation exists, right? How should one think about the system when common subset from pre-training/post-training data may be unknown/may not exist/may be inaccessible? (Q2) Line 73, 210: Does the system require a separate model instance on the server for each client? If yes, is that scalable to large number of clients? If no, what do experiments suggest about observed number of model instances on the server per client, across datasets of interest? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next