ECSO: Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

1Southern University of Science and Technology, 2Hong Kong University of Science and Technology, 3Huawei Noah's Ark Lab 4Peng Cheng Laboratory
(*Equal contribution. Corresponding author. )

🔥1. Make MLLM safe without neither training nor any external models!
🔥2. Free data engine for MLLM alignment on its own!

Abstract

Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features.

To construct safe MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of the pre-aligned LLMs in MLLMs. Experiments with five state-of-the-art (SOTA) MLLMs demonstrate that ECSO significantly enhances model safety (e.g., a 37.6% improvement on MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks.

Furthermore, we demonstrate that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for the alignment of MLLMs without extra human intervention.

Safety persists in MLLMs without images

(left) MLLMs are vulnerable to malicious questions when queried with images. However, when images are excluded, the MLLM becomes safe again.
(right) Comparisons of harmless rate (%) of model responses to questions in VLSafe-examine with and without images on five state-of-the-art MLLMs.

MLLMs are Aware of Unsafe Responses.

(left) Althouth vulnerable to malicious questions, MLLMs are aware of the unsafe responses of their own.
(right) Accuracy of MLLMs discrimination (with and without images) on whether their own responses are safe or not .

ECSO Overview

Overview of ECSO. Step 1: User queries are full-filled as usual. Step 2: The MLLM is prompted to judge whether its initial response is safe or not. Safe answers are returned, while unsafe ones proceed Step 3 and 4. Step 3: Images of unsafe queries are converted into texts via query-aware text-to-image transformation. Step 4: Malicious content in either images or user queries are now both represented by plain text, which can be deal with by the pre-aligned LLMs in MLLMs to generate safe responses

ECSO Examples

🔥Build your own demo following the guidelines here

More Examples

ECSO Boots MLLM Safety Across Models

Harmless rates on MM-SafetyBench with LLaVA-1.5-7B.
Direct: MLLMs' responses when directly prompted to answer.
ECSO: MLLMs' responses using ECSO.

Harmless rates on MM-SafetyBench (SD+OCR) for the ShareGPT4V7B, mPLUG-Owl2-7B, Qwen-VL-Chat and InternLM-XComposer-7B. Blue and orange shades represent the harmless rates when querying MLLMs directly and with our proposed ECSO, respectively.

ECSO Maintains MLLMs' Utility

Utility scores of MLLMs on MME-P (Perception), MME-C (Cognition), MM-Vet, and MMBench, separately. The safety improvement of ECSO comes without sacrificing the utility performance.

ECSO as Data Engine for Safety Alignment

Comparisons of harmless rates on MM-SafetyBench (SD+OCR) across different finetuned models.
ECSO: the original LLaVA-1.5-7B equipped with training-free ECSO.
ECSO_VLGuard: directly prompting LLaVA-1.5-7B which has been finetuned on the data generated by ECSO using the queries of VLGuard together with utility data.
VLGuard: directly prompting LLaVA-1.5-7B which has been finetuned on the ground-truth data of VLGuard together with utiltiy data.

BibTeX

@article{gou2024eyes,
      title={Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation},
      author={Gou, Yunhao and Chen, Kai and Liu, Zhili and Hong, Lanqing and Xu, Hang and Li, Zhenguo and Yeung, Dit-Yan and Kwok, James T and Zhang, Yu},
      journal={arXiv preprint arXiv:2403.09572},
      year={2024}
    }