|
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Bayesian method for model merging. The goal is to approach Pareto front of multitask by efficiently and approximately computing the posterior with mixture of Gaussian. As a consequence, this method balance the efficiency that is lacking by full Gaussian posterior and the utility that is lacking by isotropic Gaussian. Experiments on transformers have shown some improvement that model merging gets closer to multitask finetuning.
This paper is clearly written with good introduction and motivation. The Bayesian approach makes sense to me and is original as far as I can tell. Figure 1 really does a good job highlighting the idea. Section 3.2 positions this method appropriately in the literature. Overall the quality is good, with all the derivation and reasoning being sound.
1. Methodology
Section 3.4 lists three versions and different experiments seem to use different versions. It would be beneficial to converge to one method if possible for practitioners. If not, can the authors summarize the applicability of each version?
2. Weaker performance than multitask finetuning
The message from this work is two-fold: variational model merging is better than previous model merging, but it is still worse than multitask finetuning (see Table 1 and Figure 5b). While the second part is not a positive result, I think it is very valuable. However, to take the second conclusion seriously, the overall method between line 292-294 may need a 4th step: launch multitask finetuning for the Pareto estimates.
3. Computational cost
This method is still computationally heavy, e.g. finetuning T models for T tasks. While many model merging methods are costly, this painpoint is not yet alleviated by this method so I think the significance is not great.
See weaknesses. |
Fully human-written |
|
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes variational model merging, a Bayesian approach to estimate Pareto fronts in multitask finetuning by merging task-specific posterior approximations. The key insight is that more flexible posterior families (e.g., full Gaussians, mixture of Gaussians) yield better Pareto front estimates than simpler ones
1. Novel theoretical framework: Connecting model merging to Bayesian posterior fusion is novel and provides a principled way to derive new merging strategies. The variational perspective naturally explains why different merging methods exist and how to improve them.
2. Clear theoretical contribution: Theorem showing that more flexible posteriors necessarily yield better estimates is valuable, with the error reduction property being particularly insightful.
3. Comprehensive experiments: Testing on diverse architectures and tasks (vision, NLP, translation) demonstrates broad applicability.
1. Missing bounds on approximation quality relative to true Pareto front. Also, can authors provide a formal connection between posterior quality and Pareto front accuracy? In other words, a bound on how the approximation quality translates to Pareto solution quality. Currently, the paper only shows empirically that better posteriors help, but doesn't prove: how much they help and when they're guaranteed to help?
2. computational costs. 1. mixture methods require K times more models, which is expensive for large models. 2. As author already state, Hessian approximation is a bottleneck for large model merging, even diagonal approximations require O(P) storage. Analysis of computational costs for various
Can this framework handle constraints or preferences on the Pareto front? |
Lightly AI-edited |
|
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors employed a Bayesian model-merging approach that efficiently explores various weighting configurations without requiring full retraining for each one. Their method relies on two key components: model merging, which combines the parameters of models individually trained on separate tasks instead of retraining for every configuration, and a Bayesian framework, which enhances the merging process by developing improved surrogate functions for the multitask learning objective. This allows practitioners to effectively explore task-weighting options and find high-performing models at a fraction of the computational cost of traditional retraining.
- The paper's primary strength is its novel conceptualization of model merging as a variational Bayesian inference problem. This original framework is significant because it replaces ad-hoc merging heuristics with a foundation that both explains the relative performance of existing methods and provides a clear recipe for systematically designing new, more accurate ones.
- Extensive empirical validation on modern, large-scale architectures, including Vision Transformers and the GEMMA-2B LLM.
- The paper's primary goal is to provide "fast and cheap methods" to estimate the Pareto set. However, its best-performing and most novel method, Mixture-Weighted Merging (MultiIVON-Hess), has a training cost that scales linearly with the number of mixture components ($K$). This requires $K$ full training runs for each task, which creates a significant tension with the "cheap" objective.
- The number of components $K$ seems to be a critical hyperparameter. The paper uses $K=30$ for ResNet, $K=10$ for ViT, and $K=3$ for RoBERTa and GEMMA. How was $K$ chosen for each experiment? Is there a principled way to select $K$? |
Lightly AI-edited |
|
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper frames *model merging* as approximate Bayesian inference to cheaply preview the Pareto set for multitask finetuning. Starting from task-specific posteriors, it proposes estimating Pareto solutions by maximizing a merged posterior. This unifies prior weight-averaging (simple averaging / task arithmetic) with Hessian-weighted schemes and introduces a mixture-of-Gaussians variant solved via a lightweight EM procedure. Empirically, across CIFAR-10 (ResNet-20), CLIP ViT-B/32 transfers, RoBERTa sentiment, and GEMMA-2B LoRA MT, more expressive posteriors (diagonal Fisher or Hessian to mixtures) produce Pareto fronts closer to multitask finetuning while being much cheaper than retraining across many values.
The paper clearly derives how scalarized multi-objective training corresponds to MAP under a merged posterior; it shows that common merging tricks are special cases of exponential-family surrogates (e.g., simple averaging from isotropic Gaussian; Hessian-weighted from full Gaussian). This gives a principled recipe rather than ad-hoc formulas. Consistent empirical trend: Across tasks and model families, more flexible posterior means better front.
- Approximation stack is heavy and sometimes crude. Many experiments rely on diagonal Hessians/Fishers or squared-gradient proxies, which can mischaracterize curvature and interactions (acknowledged by the authors).
- Cost shifts rather than disappears. MoG requires K runs per task and a few EM steps. While still cheaper than dense alpha sweeps, for large T and K this becomes significant; the paper reports K=3–30, which is nontrivial in big models.
- Hessian quality vs. downstream accuracy. The argument that IVON supplies a “free” diagonal Hessian is practical, but no controlled study links Hessian quality to Pareto-front error beyond rough accuracy differences. A calibration plot (front error vs. curvature error) would strengthen the causal story. (Setup & comparisons.)
- The method assumes task-specific posteriors are compatible under a common prior. In settings with strong parameter non-identifiability or sharp mode shifts (e.g., safety vs. creativity in LLMs), merging may land off-manifold; the paper hints at such gaps (e.g., shape mismatches) but doesn’t delineate failure modes or detection heuristics.
- Reported times exclude the up-front cost of training each task model (and K variants for mixtures). For large T, the one-time cost may approach/ exceed a modest grid of multitask runs; a more apples-to-apples wall-clock accounting would help. |
Fully AI-generated |