|
Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the complex problem of task conflict and synergy during LLM fine-tuning. The authors observe that training a model on one reasoning task can often degrade its performance on another. To address this, they introduce a parameter identification method called Delta-Scale Row (DSR) to find the parameters most critical to a specific task. They then propose a new fine-tuning strategy, Differential Fine-Tuning (DiFT), which selectively updates or freezes these DSR parameters based on the learning scenario. The goal is to mitigate the conflicts that arise when mixing tasks or learning them sequentially. The authors provide experimental results across several models (Llama3, Mistral, Qwen2.5) to demonstrate that DiFT outperforms existing baseline methods.
Clear writing. The motivation, pipeline, and training regimes are easy to follow. Key concepts (DSR, union/difference rules) are defined and used consistently.
New method. DiFT introduces a simple activation-delta criterion to pick “sensitive rows,” then applies union/difference rules for mix-up and continual learning. The idea is lightweight and practical.
Extensive experiments. Results span multiple models and task families, with ablations and sanity checks (e.g., inverse variants). Findings appear consistent across settings.
(1) Assumption validation is improved but still incomplete.
Inverse-DiFT and random-selection baselines support necessity and non-randomness. Still missing stronger causal probes. Cross-task “row transplant” tests, DSR computed on disjoint sources, and counterfactual swaps across tasks would better establish causality beyond necessity.
(2) Heuristic choices lack broader justification.
The ablation on the number of rows (C) addresses set size. However, the union rule (mix-up) and the difference rule (continual) remain untested against plausible alternatives. No intersection, weighted unions, or trust-region style variants. No analysis of when overlap helps vs. harms. Results could benefit from a principled rationale or a small design sweep in the main text.
(3) Missing continual-learning baselines.
No direct comparisons with parameter-importance methods: EWC/Online-EWC, L2-SP, MAS, SI, PackNet, LwF. Without them, placement within the CL literature is unclear.
(4) Potential signal–target entanglement.
DSR selection uses the same task family that later benefits in training/evaluation. This can inflate ATA via selection–target coupling. Need results where DSR is computed on held-out variants or adjacent corpora.
(1) Signal–target entanglement.
DSR is computed with data from the target task, then those rows are trained on mixtures including the same family. Could ATA gains partially reflect selection–target leakage?
(2) Relation to importance metrics.
How does DSR compare against Fisher information, diagonal Hessian, or gradient-based importance (EWC, L2-SP, PackNet, LwF) under identical budgets?
(3) Layer and granularity choices.
Which layers are hooked—every linear layer or a subset? How sensitive are results to selecting rows in early vs. deep layers?
(4) Continual rule clarity.
For DSR_diff, does training freeze everything except the newly identified rows, or only historical DSR sets? What remains trainable for rows never selected by any task? Please show capacity remaining over steps (k). |
Fully AI-generated |
|
Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces **Differential Supervised Fine-Tuning (DiFT)**, a principled framework to mitigate conflicts among multiple reasoning tasks during supervised fine-tuning (SFT) of large language models (LLMs).
The authors observe that when LLMs are fine-tuned jointly or sequentially on heterogeneous reasoning datasets (e.g., math, code, logic, commonsense), performance may degrade due to destructive interference. Through comprehensive experiments on Llama3-8B, Mistral-7B, and Qwen2.5-14B, they show that task combinations can yield both mutual benefits and conflicts, motivating a more nuanced SFT approach.
The paper’s central insight is that each reasoning ability corresponds to a subset of model parameters that are disproportionately influential for that task. To identify these, the authors propose a quantitative metric called the Delta-Scale Row (DSR) score, computed from activation differences between base and fine-tuned models.
Building upon DSR analysis, DiFT differentially updates parameters. This selective parameter updating reduces interference and preserves mutual benefits across tasks.
Empirically, DiFT achieves consistent improvements across math (GSM8k), code (xGLUE), logic (LogiQA2), and commonsense (CSQA) reasoning benchmarks, outperforming strong baselines such as HFT, LoTA, DMT, and CoBa. The proposed analysis and method generalize across both base and instruct LLMs, showing broad applicability.
1. The Delta-Scale Row (DSR) analysis offers a novel and interpretable metric for identifying task-specific sensitive parameters, bridging parameter variation analysis with reasoning-oriented fine-tuning.
2. The discovery of reasoning task interactions—synergistic or conflicting—across math, code, logic, and commonsense reasoning provides valuable empirical insight into multi-ability LLM training.
3. The theoretical motivation for DSR is grounded in Taylor expansion and sensitivity analysis, giving analytical meaning to the empirical findings.
4. Contributes to a deeper understanding of task interference and transfer in multi-objective fine-tuning, a core challenge for multi-ability LLMs.
1. While DSR analysis is well-motivated, the connection between row-wise sensitivity and global reasoning dynamics remains empirical. A theoretical characterization of why DSR partitions correlate with reasoning abilities would strengthen the contribution.
2. Although the paper claims DiFT is computationally efficient, quantitative runtime and FLOPs comparisons with baselines are absent; adding explicit cost metrics would improve clarity.
3. Experiments focus on four reasoning tasks; extending to non-symbolic reasoning (e.g., spatial or multi-hop tasks) would further validate generality.
4. Although results generalize to different LLMs, scaling behavior (e.g., how DSR sparsity changes with model size) is not analyzed.
1. How stable are DSR distributions across random seeds or dataset subsamples? Does the identified task-specific subspace remain consistent under noisy data?
2. Can DSR be estimated during training rather than post-hoc, enabling online conflict detection and adaptive freezing?
3. How sensitive is DiFT to the hyperparameter controlling the top-$C$ DSR threshold? Are results robust to different $C$ values? |
Fully AI-generated |
|
Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the "task conflict" problem prevalent in multi-task (mixed or continual) supervised fine-tuning of LLMs, where fine-tuning on diverse reasoning datasets often leads to performance degradation in specific abilities. To address this, the paper first introduces an analysis method called "delta-scale row (DSR)," which identifies "exclusive parameters" critical to specific tasks by comparing the parameter activation differences between a base model and models fine-tuned on a single task. Building on this discovery, the authors propose the "Differential Fine-Tuning (DiFT)" framework: for mix-up SFT, only the union of the exclusive parameters for all relevant tasks is updated; for continual SFT, only the difference-set (parameters exclusive to the new task relative to old tasks) is updated. Experimental results on Llama3-8B, Mistral-7B, and Qwen2.5-14B demonstrate that DiFT consistently improves performance in the mix-up setting and effectively mitigates performance collapse in the continual setting.
1. The paper is well-motivated. Through solid preliminary experiments (e.g., Figure 1, Table 1), the authors clearly reproduce and quantify the 'conflict' phenomenon that arises in multi-task reasoning SFT, providing a strong impetus for their subsequent investigation.
2. The paper introduces the DSR diagnostic tool and the DiFT method. Instead of merely observing the superficial phenomenon that 'tasks conflict,' the authors delve deep into the model's internals to pinpoint the specific parameter subsets responsible for these conflicts. This "diagnose-then-treat" research paradigm is rigorous and well-founded.
3. The authors provide validation covering both 'mix-up' and 'continual' fine-tuning scenarios. Crucially, the ablation studies on 'Inverse DiFT' and 'Random DSR' convincingly demonstrate that the performance gains stem from correctly identifying critical parameters via DSR, rather than being a simple side effect of parameter reduction (i.e., regularization).
1.The paper lacks a robust theoretical foundation. While DSR is an interesting empirical discovery, the paper fails to provide a solid theoretical explanation for the underlying mechanism—why do LLMs tend to encode capabilities for different tasks into distinct parameter subsets?
2. The task scalability of the mix-up strategy is questionable. DiFT relies on updating the parameter union, but the experiments are limited to K=2 or 3 tasks. As K increases (e.g., in a practical scenario involving 20 tasks), the size of this union set could rapidly inflate, potentially approaching the full set of parameters and causing the DiFT strategy to degenerate into standard SFT. The paper lacks an analysis of DiFT's performance when K > 3.
3. The DSR method introduces significant workflow complexity. The framework requires an additional pre-computation stage: a separate "probe model" must be trained for each of the N single tasks before the main fine-tuning can commence. This requirement increases the complexity of the training pipeline.
4. The method's applicability to larger-scale models and different architectures is unproven. The experiments are confined to 7B-14B scale dense models, and it remains unknown if the approach holds for 30B+ models. The authors' discussion of MoE limitations (due to unstable routing) is reasonable, but this concession explicitly restricts the method's current applicability.
1. DSR identification relies on a "probe model" trained with 1k samples. How stable is this DSR set? Would the identified set of "critical" parameters change significantly if identified using the full 20k dataset, or if the DSR analysis were re-run after the DiFT mix-up SFT? This questions whether DSR is a static property or one that evolves dynamically during training.
2. The mix-up SFT strategy relies on updating the parameter union. Have the authors investigated how the size of this union set scales as the number of tasks, K, increases (e.g., to K=5)? Is there a tipping point at which this union becomes so large that the DiFT strategy effectively degenerates into standard full-parameter SFT?
3. The continual SFT strategy avoids conflicts by freezing parameters from historical tasks. However, for the parameter intersection—those parameters critical to both old and new tasks—does this freezing strategy cause the model to "under-learn" the new task? Or, do the authors believe the benefit of mitigating forgetting always outweighs this risk of under-learning?
4. The authors state in the limitations that DSR is not directly applicable to MoE models due to unstable routing—a reasonable constraint. However, how do the authors conceptualize the relationship between DSR and the MoE architecture? Since MoE itself is a form of architectural parameter specialization, could the "exclusive parameters" identified by DSR be interpreted as "emergent experts"? Put differently, in an MoE model, is it hypothesized that a task's DSR would be primarily concentrated within its corresponding expert?
5. DSR successfully identifies which parameters are in conflict, but not fully why they conflict. Do the authors have a deeper hypothesis? For example, in Figure 7, the DSR pattern for csqa (commonsense) (rows j, k, l) appears distinctly different from the patterns for math, code, and logic (rows a-i). Does this imply that the root cause of the conflict is the model's attempt to encode fundamentally different types of information (e.g., symbolic manipulation vs. factual knowledge) within the same set of parameters (like mlp.gate_proj)? |
Moderately AI-edited |
|
Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Differential Fine-Tuning (DiFT), a novel fine-tuning strategy designed to mitigate task conflicts while preserving performance benefits across multiple reasoning tasks in large language models (LLMs). The method identifies task-specific parameters through a Delta-Scale Row (DSR) analysis and selectively updates them to avoid destructive interference among reasoning abilities. Experimental results on Llama3-8B, Mistral-7B, and Qwen2.5-14B demonstrate that DiFT can outperform existing mix-up and continual fine-tuning baselines such as DMT, CoBa, HFT, and LoTA, improving average target accuracy (ATA) across various reasoning benchmarks including GSM8k, xGLUE, LogiQA2, and CSQA. The work is well-motivated and offers an interesting parametric perspective on reasoning conflicts in LLM fine-tuning.
1.The paper addresses an important and relatively underexplored problem—task interference and reasoning conflicts during supervised fine-tuning of LLMs.
2.The DSR-based parameter analysis provides an insightful method to identify task-sensitive parameters, which could inspire future research on interpretable fine-tuning.
3.The proposed DiFT framework is conceptually elegant and achieves consistent empirical gains across diverse reasoning benchmarks and model sizes.
4.The authors also provide comprehensive experiments, ablation studies, and discussions comparing mix-up and continual learning paradigms, which enhance the credibility of their findings.
While I find the idea of DiFT quite promising, I have several concerns and questions that the authors may wish to address to strengthen the work:
1.Scalability with Task Number. I express concerns about the scalability of the proposed DiFT method, noting that maintaining task-specific parameter updates for multiple reasoning tasks may cause the model size and memory usage to grow as the number of tasks increases. It is therefore recommended to conduct experiments with different numbers of tasks (e.g., two, three, or more) to evaluate how DiFT performs in large-scale and multi-domain settings. Such experiments would help demonstrate the method’s practicality and efficiency in real-world applications involving large models.
2.Relation to Meta-Learning and Mixture-of-Experts Paradigms. I am interested in how DiFT conceptually and empirically compares with existing meta-learning and multi-expert approaches that also address knowledge sharing and task specialization across multiple domains. It is suggested that the paper explicitly discuss these connections, highlighting the key distinctions between DiFT and meta-learning or expert-based modular frameworks. Additionally, including a comparative performance analysis—at least at a conceptual level—would help clarify the unique contribution and positioning of DiFT within the broader landscape of multi-task learning methods.
3.Substitution of the Key Parameter Identification Mechanism. The DSR mechanism plays a central role in DiFT, but I wonder how it compares with other possible parameter importance estimation techniques, such as attention-based attribution or gradient sensitivity analysis. Could the authors provide either an ablation about substituting DSR with attention-based mechanisms to examine robustness and generality of the framework?
4.Limited Experimentation on Larger Models. The authors mention that, “Due to hardware limitations, we only conducted experiments on 7/8B and 14B LLMs in this paper, lacking validation on larger-scale (30B+) models that can be complementary.” However, larger-scale validation seems critical to the paper’s claim that DiFT improves diverse reasoning capabilities of LLMs in general. Since scalability is a core aspect of language model research, I strongly recommend that the author provide results on the expected performance of DiFT on various different-sized models. Without this, it remains unclear whether the proposed approach would maintain its efficiency and stability at scale.
5.Lack of Convergence Analysis in Fine-Tuning. The paper does not provide any theoretical or empirical evidence of convergence during fine-tuning. Since the proposed DiFT method involves differential parameter updating across multiple reasoning tasks, it is unclear whether the optimization process is guaranteed to converge to a stable point. Moreover, when the number of tasks increases, convergence behavior could vary substantially across different parameter subsets. I am particularly concerned that without monitoring the fine-tuning loss or parameter change trajectories, it is difficult to ensure that each reasoning sub-model (or the overall DiFT model) has reached a stable equilibrium. Providing convergence curves for varying task counts (e.g., 2, 3, or more reasoning datasets) would significantly strengthen the technical soundness of this work.
6.Missing Computational Cost and Efficiency Analysis. The paper does not provide sufficient information about how computational efficiency scales with the number of fine-tuning tasks. While Section 5.1 briefly mentions that the computing costs are lighter than task-vector and gradient-based methods, it lacks quantitative evidence showing how training time, token consumption, and GPU (CUDA) utilization change as the number of reasoning tasks increases. Such data are crucial to assess whether DiFT remains efficient and feasible when extended to larger task sets. Including analyses of runtime, token usage, and GPU memory consumption under different task counts would help clarify the scalability and practical efficiency of the proposed method compared to baselines like LoTA or HFT.
See weaknesses |
Fully AI-generated |