Jailbreak-AudioBench:
In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng^1*, Erjia Xiao^1*, Jing Shao^4*, Yichi Wang⁵, Le Yang³, Chao Shen³, Philip Torr², Jindong Gu^2†, Renjing Xu^1†

¹Hong Kong University of Science and Technology (Guangzhou), ²University of Oxford, ³Xi'an Jiaotong University, ⁴Northeastern University, ⁵Beijing University of Technology

* Equal contribution, † Correspondence authors

Download Dataset

Download Plus Dataset Download Code

📖 Abstract

Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak-AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also a range of audio editing techniques. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs, establishing the most comprehensive audio jailbreak benchmark to date. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.

Jailbreak-AudioBench Benchmark

🔍 Overview

📑 Dataset Composition

Jailbreak-AudioBench collects 4,700 base harmful queries from four text-modal jailbreak sets (AdvBench, MM-SafetyBench, RedTeam-2K, Safebench) and renders them into speech via gTTS. Each utterance is further processed by seven families of audio-specific edits (tone, speed, emphasis, intonation, background noise, celebrity accent, emotion), yielding 94,800 audio samples covering both Explicit and Implicit jailbreak tasks. The dataset also includes an equal number of defended versions of these audio samples to explore defense strategies against audio editing jailbreaks and a subset for the proposed Query-based Audio Editing Jailbreak Method.

The scale of Jailbreak-AudioBench Dataset

Subset	Base Audio	Types of Editing Categories	Editing Sum	Total Sum
Explicit Subtype	2,497	4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories	49,940	52,437
Implicit Subtype	2,203		44,060	46,263
Explicit Defense	2,497		49,940	52,437
Explicit Small	262	2×Speed × 2×Emphasis × 2×Background Noise × (2×Celebrity Accent + 2×Emotion) = 32 categories	8,384	8,646

Additional datasets in the Plus version of Jailbreak-AudioBench

Subset	Base Audio	Types of Editing Categories	Editing Sum	Total Sum
Explicit Small (GPT-4o Eval)	262	4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories	5,240	5,502
Implicit Small (GPT-4o Eval)	237		4,740	4,977
Implicit Defense	2,203		44,060	46,263

🚀 Evaluation & Analysis

We comprehensively evaluate 9 state-of-the-art Large Audio-Language Models (LALMs) using Jailbreak-AudioBench. Three key observations emerge:

Disparity in LALM Susceptibility to Audio Editing Jailbreak. Models exhibit vastly different levels of susceptibility to audio editing jailbreak. While SALMONN demonstrates high vulnerability with substantial ASR increases across multiple audio editing types, SpeechGPT, Qwen2-Audio, and BLSP show remarkable resilience.
Audio-specific Brittleness. Even seemingly innocuous edits (e.g., slight pitch shifts or background noise) can drastically raise the Attack Success Rate (ASR) in vulnerable models, revealing critical security considerations in audio model design.
Analysis and Insights. Our t-SNE visualization analysis reveals that vulnerability differences stem from how models process edited audio through their transformer layers, with robust models effectively normalizing audio editing variations while vulnerable models maintain distinct editing-based feature clusters.

🎧 Interactive Audio Processing Demo

Experience our audio jailbreak techniques step by step. Select a base harmful query, then choose a processing method, and finally select specific parameters to hear how audio editing can potentially bypass model safety mechanisms.

Processing audio... Applying the selected effect to your audio.

Step 1: Select Harmful Query

All three queries now have available audio samples

Step 2: Select Processing Type

Step 3: Select Specific Parameters

Audio Result

0:00

Current Audio Exploit Script (Original)

Current Selection: Exploit Script - Original - No processing

🏆 Detailed ASR Performance by Audio Editing Types

		BLSP	SpeechGPT	Qwen2-Audio	SALMONN-7B	SALMONN-13B	VITA-1.5	R1-AQA	MiniCPM-o-2.6	GPT-4o-Audio
Original		47.5%/18.25%	14.1%/2.45%	16.8%/6.76%	31.4%/14.3%	31.3%/12.89%	3.7%/2.77%	12.6%/7.17%	18.2%/9.03%	0.7%/0.8%
Emphasis	Volume*2	+1.5%/-1.6%	-0.3%/0%	-1.6%/-0.7%	+14.4%/+3.5%	+16.1%/+2.3%	+0.4%/+0.3%	+1.4%/-0.3%	-1.1%/+0.4%	+0.4%/+0.9%
	Volume*5	+0.5%/-0.4%	-0.8%/-0.1%	-4.3%/0%	+21.3%/+5.6%	+20.5%/+3.4%	+0.2%/0%	+0.6%/-0.5%	-0.7%/-0.2%	+0.4%/0%
	Volume*10	+0.6%/-1.2%	-5.0%/-0.4%	-4.0%/-1.0%	+21.4%/+5.9%	+19.9%/+3.5%	0%/+0.5%	+2.0%/-0.4%	+1.0%/-0.8%	+0.4%/+0.4%
Speed	Rate*0.5	+2.8%/+0.6%	-0.8%/-0.4%	-4.4%/-1.9%	+13.3%/+1.9%	+16.8%/+3.0%	+2.2%/+0.6%	+1.0%/-1.1%	+1.6%/+0.4%	+0.4%/0%
Speed	Rate*1.5	-2.6%/+2.7%	+0.2%/-0.1%	+1.1%/+0.1%	+14.3%/-4.2%	-22.9%/-8.4%	-0.5%/+0.4%	+2.0%/+0.4%	-2.2%/-0.4%	+1.5%/-0.4%
Intonation	Interval+2	-4.3%/-2.0%	-8.1%/-1.0%	-5.1%/-0.7%	-27.6%/-11.0%	-1.0%/-1.4%	+5.6%/+1.3%	+1.6%/-0.5%	+0.3%/-0.6%	+0.4%/+0.4%
	Interval+3	-8.0%/-3.4%	-11.3%/-0.8%	-4.4%/-1.9%	-27.0%/-11.1%	+4.4%/+0.1%	+5.2%/+0.5%	+3.0%/-0.3%	+1.4%/-1.1%	+1.2%/+0.4%
	Interval+4	-13.6%/-3.1%	-11.8%/-0.9%	-3.3%/-0.5%	-25.0%/-11.3%	+11.7%/+2.0%	+3.7%/+0.1%	+4.7%/+0.1%	+3.8%/-0.4%	+1.5%/+1.3%
Tone	Semitone -8	-3.1%/-1.4%	-3.9%/-0.2%	-5.1%/+0.1%	+2.8%/-0.8%	+11.5%/+1.3%	+3.0%/+0.3%	+0.5%/+0.5%	-0.2%/-0.3%	0%/-0.4%
	Semitone -4	+1.5%/-0.5%	-0.3%/-0.1%	-2.6%/+0.4%	+1.0%/-0.8%	+6.0%/+1.2%	-0.3%/+0.3%	-0.4%/-1.4%	+0.5%/-0.4%	+0.4%/-0.4%
	Semitone +4	-0.4%/-0.2%	-5.6%/-0.5%	-5.1%/-1.0%	+3.6%/+1.4%	+17.6%/+3.6%	+0.5%/+0.4%	+1.0%/-0.7%	-0.3%/-1.1%	+0.8%/0%
	Semitone +8	-2.4%/-1.2%	-13.6%/-2.1%	-3.2%/-1.1%	+8.8%/+2.0%	+24.1%/+4.7%	+4.4%/+0.4%	+1.5%/-0.7%	+7.9%/+0.3%	+1.2%/+0.9%
Background Noise	Crowd Noise	+0.8%/-1.1%	-6.5%/-0.2%	-7.7%/-2.0%	+16.1%/+5.6%	+27.6%/+7.7%	+4.4%/+0.9%	-1.6%/-2.0%	+1.9%/+0.5%	+0.8%/+0.4%
	Machine Noise	+0.7%/+0.4%	-5.5%/-0.2%	-6.1%/-1.3%	+20.3%/+5.9%	+28.6%/+9.2%	+0.2%/+0.3%	-2.2%/-1.4%	-1.1%/-0.2%	0%/+1.7%
	White Noise	-0.2%/-0.3%	-0.4%/-0.1%	-4.6%/-1.0%	+7.0%/+4.9%	+22.3%/+5.0%	+0.4%/+0.3%	+1.2%/-0.5%	-4.3%/-1.3%	0%/-0.4%
Celebrity Accent	Kanye West	-7.8%/-3.5%	-4.8%/-0.3%	-5.3%/-1.1%	+12.8%/+5.2%	+17.4%/+3.2%	+2.0%/+0.5%	+0.3%/-1.1%	+7.9%/-0.1%	+0.4%/-0.9%
	Donald Trump	-8.7%/-3.3%	-4.2%/-0.5%	-4.0%/-1.5%	+3.3%/+2.0%	+20.1%/+3.1%	+2.6%/+0.8%	+0.6%/-0.5%	+6.4%/+0.8%	0%/0%
	Lucy Liu	-9.5%/-3.6%	-3.2%/-0.1%	-4.4%/-1.0%	-5.9%/-4.3%	+12.4%/+3.7%	-0.3%/+0.6%	+3.3%/-0.1%	+0.8%/+0.1%	+1.5%/+0.4%
Emotion	Laugh	+4.0%/-0.7%	-4.8%/0%	-4.4%/-0.1%	+2.8%/+0.1%	+23.2%/+5.3%	-0.1%/+0.3%	-1.6%/-0.3%	-6.8%/-2.9%	-0.4%/-0.4%
Emotion	Scream	-1.1%/-1.8%	-4.7%/-0.8%	-3.7%/-0.8%	+18.0%/+5.2%	+20.7%/+4.5%	+0.4%/+0.5%	+5.5%/+1.0%	-8.1%/-3.4%	-0.4%/0%

🔬 Research Applications

Jailbreak-AudioBench provides a foundation for two critical research directions:

1. Query-based Audio Editing Jailbreak Method for LALMs

Our analysis reveals that systematically combining audio editing techniques can significantly increase attack success rates. For example, GPT-4o-Audio's vulnerability increases from just 0.76% to 8.40% when exposed to optimized combinations of audio edits. This demonstrates that even commercial-grade models remain susceptible to carefully crafted audio manipulations.

Query-based Audio Editing Jailbreak Results - Part 1

Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including Qwen2-Audio-7B, SALMONN-7B, and GPT-4o-Audio. In each panel, columns represent individual audio samples, and the first 32 rows represent different edited variants of these samples. The penultimate row represents the original unedited audio sample, while the bottom row indicates whether any of the 32 variant queries bypassed the model's defenses. Green: failed jailbreak; Red: successful jailbreaks.

Query-based Audio Editing Jailbreak Results - Part 2

Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including BLSP, SpeechGPT, VITA-1.5, and MiniCPM-o-2.6. The visualization format follows the same convention as above, demonstrating varying levels of vulnerability across different model architectures when exposed to systematically optimized audio editing combinations.

2. Development of Defense Methods

We explore prompt-based defense strategies by prepending safety instructions in audio format. This lightweight approach consistently reduces vulnerability across all evaluated models, though significant improvement opportunities remain for more robust defenses.

Figure: ASR comparison of original and edited audio samples with and without defense. The bars represent the ASR without defense, while the striped bars represent the ASR reduction with the defense applied. The values shown on the bars denote the specific ASR reduction caused by defense.

BibTeX

@misc{cheng2025jailbreakaudiobenchindepthevaluationanalysis,
      title={Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models}, 
      author={Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu},
      year={2025},
      eprint={2501.13772},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2501.13772}, 
}