Model Merging

osanseviero 's Collections

Papers I've read

Model Merging

MoEs papers reading list

ML for Tools

OS Week Highlights - Oct 16 - 22

OS Week Highlights - Oct 9 - 15

OS Week Highlights - Oct 2 - 8

OS Week Highlights - Sept 25 - Oct 1

OS Week Highlights - Sept 18 - 24

Historical - Spaces of the Week

Mistral Instruct Merges

Papers I want to read

Instruction Pre-Training

updated Jun 12

Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it!

Upvote

217

Qualitatively characterizing neural network optimization problems

Paper • 1412.6544 • Published Dec 19, 2014 • 4

Note Analyzes the optimization landscape of training neural networks using linear interpolation experiments.
Convergent Learning: Do different neural networks learn the same representations?

Paper • 1511.07543 • Published Nov 24, 2015 • 2
Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Paper • 1909.11299 • Published Sep 25, 2019 • 1

Note Mixout is a regularization technique that improves stability and performance of LLMs on downstream tasks by stochastically mixing parameters of two models. Mixout acts as an L2 regularizer and prevents catastrophic forgetting/divergence.
Model Fusion via Optimal Transport

Paper • 1910.05653 • Published Oct 12, 2019 • 1

Note A layer-wise fusion algorithm. This allows one-shot knowledge transfer without retraining and outperforms normal averaging. It also enables fusing models of different size, hence facilitating compression and federated learning
Federated Learning with Matched Averaging

Paper • 2002.06440 • Published Feb 15, 2020 • 2

Note FedMA is a layer-wise federated learning algorithm for CNNs and LSTMs that averages hidden elements with similar feature extraction signatures.
Merging Models with Fisher-Weighted Averaging

Paper • 2111.09832 • Published Nov 18, 2021 • 1

Note Fisher merging, a weighted averaging method for combining NNs. It shows better performance over standard (unweighted )parameter averaging in model ensembling. It's a cheaper alternative to traditional transfer learning methods.
On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks

Paper • 2110.15538 • Published Oct 29, 2021 • 1

Note CLAFusion is a method to fuse neural networks doing cross-layer alignment and layer balancing in an efficient way. This works with networks with different depths.
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Paper • 2203.05482 • Published Mar 10, 2022 • 6

Note Average weights of multiple fine-tuned models for computer vision. This yields better accuracy and no extra inference cost, and improves robustness to distribution shift.
Fusing finetuned models for better pretraining

Paper • 2204.03044 • Published Apr 6, 2022 • 5

Note Fuse multiple fine-tuned models by averaging their weights. It creates a better base model for future target tasks, showing better results vs using a pretrained model. This was published almost at the same time as model soups, but in this paper the approach is to have a generalizable base model that is then fine-tuned on different target tasks.
Diverse Weight Averaging for Out-of-Distribution Generalization

Paper • 2205.09739 • Published May 19, 2022 • 1
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Paper • 2208.03306 • Published Aug 5, 2022 • 2

Note BTM is an algorithm to independently train expert LMs on different textual domains.
Git Re-Basin: Merging Models modulo Permutation Symmetries

Paper • 2209.04836 • Published Sep 11, 2022 • 1

Note Proposes algorithms to align weights of independently trained models by permitting units
lo-fi: distributed fine-tuning without communication

Paper • 2210.11948 • Published Oct 19, 2022 • 1

Note Lo-fi achieves similar or better accuracy compared to standard distributed training with communication when fine-tuning vision transformers on image classification and language models on text, without requiring any communication between nodes during training.
ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

Paper • 2212.01378 • Published Dec 2, 2022 • 1

Note Iterative method to improve models by fusing together fine-tuned models without sharing datasets. First, you pick a base model. Different contributors download the base model and fine-tune on their own dataset. We then fuse/average all the fine-tunes to improve the base model. We repeat this process to generate a stronger base model with better performance and gains in few-shot learning.
Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization

Paper • 2212.10445 • Published Dec 20, 2022 • 2

Note Fine-tuning strategy that reuses fine-tunes of the same base models as initialization to parallel fine-tunings on the same target task. This leads to strong out-of-distribution generalization by leveraging diversity across auxiliary tasks. This technique is more robust to the choice of auxiliary tasks compared to other reusing strategies.
Backward Compatibility During Data Updates by Weight Interpolation

Paper • 2301.10546 • Published Jan 25, 2023 • 2

Note Improves backwards comaptibility of models when re-finetuning on a bigger dataset
ZipIt! Merging Models from Different Tasks without Training

Paper • 2305.03053 • Published May 4, 2023 • 2
Resolving Interference When Merging Models

Paper • 2306.01708 • Published Jun 2, 2023 • 13

Note A new method, TIES-MERGING (TRIM, ELECT SIGN & MERGE), that merges multiple models into a single multitask model. TRIM addresses interference that can be caused by redundant and conflicting parameters across models. Incorrect signs at top parameters cam lead to huge performance drops, so SIGN resolution help with this.
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Paper • 2306.04488 • Published Jun 7, 2023 • 2

Note This is a technique to align LLMs using human preferences by interpolating weights fine-tuned on different proxy rewards.
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

Paper • 2307.13269 • Published Jul 25, 2023 • 31
Unified Model for Image, Video, Audio and Language Tasks

Paper • 2307.16184 • Published Jul 30, 2023 • 14

Note UnIVAL proposes model merging across different modalities (image, video, audio-text tasks).
Model Merging by Uncertainty-Based Gradient Matching

Paper • 2310.12808 • Published Oct 19, 2023 • 6

Note Why model merging works, when it can fail, and how it can be improved by unifying many existing merging schemes.
Averaging Weights Leads to Wider Optima and Better Generalization

Paper • 1803.05407 • Published Mar 14, 2018 • 2
WARM: On the Benefits of Weight Averaged Reward Models

Paper • 2401.12187 • Published Jan 22 • 18
Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Paper • 2209.14981 • Published Sep 29, 2022
Early Weight Averaging meets High Learning Rates for LLM Pre-training

Paper • 2306.03241 • Published Jun 5, 2023 • 2
Arcee's MergeKit: A Toolkit for Merging Large Language Models

Paper • 2403.13257 • Published Mar 20 • 20
Evolutionary Optimization of Model Merging Recipes

Paper • 2403.13187 • Published Mar 19 • 50
Editing Models with Task Arithmetic

Paper • 2212.04089 • Published Dec 8, 2022 • 6
Merging Improves Self-Critique Against Jailbreak Attacks

Paper • 2406.07188 • Published Jun 11 • 3

Upvote

217