MoCLE: Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

1Southern University of Science and Technology, 2Hong Kong University of Science and Technology, 3Huawei Noah's Ark Lab 4Peng Cheng Laboratory
(*Equal contribution. Corresponding author. )

🔥First MLLM with MoE for instruction customization and generalization!

Abstract

Instruction tuning of Large Vision-language Models has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in the sub-optimal instruction-following abilities.

To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.

Task Conflicts in MLLM Instruction Tuning

Only 2 out of 7 tasks benefit from instruction tuning from all the data, while the task experts show better performance on the other 5 tasks (i.e., Flickr 30K, GQA, HM, SciQA and IconQA).

Cluster-conditional LoRA MoE w/ Universal Expert

We cluster the training instruction data into 64 groups, the expert routing is decided by the cluster embedding of the data. At each layer of the LLM, the input tokens are handled by 3 modules, the universal expert (promote instruction generalization), activated LoRA experts (avoide task conflicts), and the linear module of the LLM.

Expert Task Specialization

(a): MoCLE exhibits task specialization across different experts.
(b): Naive Sentence MoE shows uniform routing decisions.

Qualitative Comparison

(Left) MoCLE demonstrates better OCR abilities. (Middle & Right) InstructBLIP is overwhelmed by the image caption tasks (giving short response) while our MoCLE better follows users instructions.

BibTeX

@article{gou2023mixture,
      title={Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning},
      author={Gou, Yunhao and Liu, Zhili and Chen, Kai and Hong, Lanqing and Xu, Hang and Li, Aoxue and Yeung, Dit-Yan and Kwok, James T and Zhang, Yu},
      journal={arXiv preprint arXiv:2312.12379},
      year={2023}
}