Instruction tuning of Large Vision-language Models has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflicts for the same set of model parameters, resulting in the sub-optimal instruction-following abilities.
To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve the generalization capabilities of MoCLE for novel instructions. Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.
Only 2 out of 7 tasks benefit from instruction tuning from all the data, while the task experts show better performance on the other 5 tasks (i.e., Flickr 30K, GQA, HM, SciQA and IconQA).
We cluster the training instruction data into 64 groups, the expert routing is decided by the cluster embedding of the data. At each layer of the LLM, the input tokens are handled by 3 modules, the universal expert (promote instruction generalization), activated LoRA experts (avoide task conflicts), and the linear module of the LLM.
(a): MoCLE exhibits task specialization across different experts.
(b): Naive Sentence MoE shows uniform routing decisions.
(Left) MoCLE demonstrates better OCR abilities. (Middle & Right) InstructBLIP is overwhelmed by the image caption tasks (giving short response) while our MoCLE better follows users instructions.
@article{gou2023mixture,
title={Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning},
author={Gou, Yunhao and Liu, Zhili and Chen, Kai and Hong, Lanqing and Xu, Hang and Li, Aoxue and Yeung, Dit-Yan and Kwok, James T and Zhang, Yu},
journal={arXiv preprint arXiv:2312.12379},
year={2023}
}