Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak-AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also a range of audio editing techniques. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs, establishing the most comprehensive audio jailbreak benchmark to date. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.
Jailbreak-AudioBench collects 4,700 base harmful queries from four text-modal jailbreak sets
(AdvBench, MM-SafetyBench, RedTeam-2K, Safebench) and renders them into speech via gTTS
.
Each utterance is further processed by seven families of audio-specific edits (tone, speed, emphasis,
intonation, background noise, celebrity accent, emotion), yielding 94,800 audio samples
covering both Explicit and Implicit jailbreak tasks. The dataset also includes an equal number of defended versions of these audio samples to explore defense strategies against audio editing jailbreaks and a subset for the proposed Query-based Audio Editing Jailbreak Method.
Subset | Base Audio | Types of Editing Categories | Editing Sum | Total Sum |
---|---|---|---|---|
Explicit Subtype | 2,497 | 4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories | 49,940 | 52,437 |
Implicit Subtype | 2,203 | 44,060 | 46,263 | |
Explicit Defense | 2,497 | 49,940 | 52,437 | |
Explicit Small | 262 | 2×Speed × 2×Emphasis × 2×Background Noise × (2×Celebrity Accent + 2×Emotion) = 32 categories | 8,384 | 8,646 |
Subset | Base Audio | Types of Editing Categories | Editing Sum | Total Sum |
---|---|---|---|---|
Explicit Small (GPT-4o Eval) | 262 | 4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories | 5,240 | 5,502 |
Implicit Small (GPT-4o Eval) | 237 | 4,740 | 4,977 | |
Implicit Defense | 2,203 | 44,060 | 46,263 |
We comprehensively evaluate 9 state-of-the-art Large Audio-Language Models (LALMs) using Jailbreak-AudioBench. Three key observations emerge:
Experience our audio jailbreak techniques step by step. Select a base harmful query, then choose a processing method, and finally select specific parameters to hear how audio editing can potentially bypass model safety mechanisms.
All three queries now have available audio samples
BLSP | SpeechGPT | Qwen2-Audio | SALMONN-7B | SALMONN-13B | VITA-1.5 | R1-AQA | MiniCPM-o-2.6 | GPT-4o-Audio | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Original | 47.5%/18.25% | 14.1%/2.45% | 16.8%/6.76% | 31.4%/14.3% | 31.3%/12.89% | 3.7%/2.77% | 12.6%/7.17% | 18.2%/9.03% | 0.7%/0.8% | ||||||||||
Emphasis | Volume*2 | +1.5%/-1.6% | -0.3%/0% | -1.6%/-0.7% | +14.4%/+3.5% | +16.1%/+2.3% | +0.4%/+0.3% | +1.4%/-0.3% | -1.1%/+0.4% | +0.4%/+0.9% | |||||||||
Volume*5 | +0.5%/-0.4% | -0.8%/-0.1% | -4.3%/0% | +21.3%/+5.6% | +20.5%/+3.4% | +0.2%/0% | +0.6%/-0.5% | -0.7%/-0.2% | +0.4%/0% | ||||||||||
Volume*10 | +0.6%/-1.2% | -5.0%/-0.4% | -4.0%/-1.0% | +21.4%/+5.9% | +19.9%/+3.5% | 0%/+0.5% | +2.0%/-0.4% | +1.0%/-0.8% | +0.4%/+0.4% | ||||||||||
Speed | Rate*0.5 | +2.8%/+0.6% | -0.8%/-0.4% | -4.4%/-1.9% | +13.3%/+1.9% | +16.8%/+3.0% | +2.2%/+0.6% | +1.0%/-1.1% | +1.6%/+0.4% | +0.4%/0% | |||||||||
Rate*1.5 | -2.6%/+2.7% | +0.2%/-0.1% | +1.1%/+0.1% | +14.3%/-4.2% | -22.9%/-8.4% | -0.5%/+0.4% | +2.0%/+0.4% | -2.2%/-0.4% | +1.5%/-0.4% | ||||||||||
Intonation | Interval+2 | -4.3%/-2.0% | -8.1%/-1.0% | -5.1%/-0.7% | -27.6%/-11.0% | -1.0%/-1.4% | +5.6%/+1.3% | +1.6%/-0.5% | +0.3%/-0.6% | +0.4%/+0.4% | |||||||||
Interval+3 | -8.0%/-3.4% | -11.3%/-0.8% | -4.4%/-1.9% | -27.0%/-11.1% | +4.4%/+0.1% | +5.2%/+0.5% | +3.0%/-0.3% | +1.4%/-1.1% | +1.2%/+0.4% | ||||||||||
Interval+4 | -13.6%/-3.1% | -11.8%/-0.9% | -3.3%/-0.5% | -25.0%/-11.3% | +11.7%/+2.0% | +3.7%/+0.1% | +4.7%/+0.1% | +3.8%/-0.4% | +1.5%/+1.3% | ||||||||||
Tone | Semitone -8 | -3.1%/-1.4% | -3.9%/-0.2% | -5.1%/+0.1% | +2.8%/-0.8% | +11.5%/+1.3% | +3.0%/+0.3% | +0.5%/+0.5% | -0.2%/-0.3% | 0%/-0.4% | |||||||||
Semitone -4 | +1.5%/-0.5% | -0.3%/-0.1% | -2.6%/+0.4% | +1.0%/-0.8% | +6.0%/+1.2% | -0.3%/+0.3% | -0.4%/-1.4% | +0.5%/-0.4% | +0.4%/-0.4% | ||||||||||
Semitone +4 | -0.4%/-0.2% | -5.6%/-0.5% | -5.1%/-1.0% | +3.6%/+1.4% | +17.6%/+3.6% | +0.5%/+0.4% | +1.0%/-0.7% | -0.3%/-1.1% | +0.8%/0% | ||||||||||
Semitone +8 | -2.4%/-1.2% | -13.6%/-2.1% | -3.2%/-1.1% | +8.8%/+2.0% | +24.1%/+4.7% | +4.4%/+0.4% | +1.5%/-0.7% | +7.9%/+0.3% | +1.2%/+0.9% | ||||||||||
Background Noise | Crowd Noise | +0.8%/-1.1% | -6.5%/-0.2% | -7.7%/-2.0% | +16.1%/+5.6% | +27.6%/+7.7% | +4.4%/+0.9% | -1.6%/-2.0% | +1.9%/+0.5% | +0.8%/+0.4% | |||||||||
Machine Noise | +0.7%/+0.4% | -5.5%/-0.2% | -6.1%/-1.3% | +20.3%/+5.9% | +28.6%/+9.2% | +0.2%/+0.3% | -2.2%/-1.4% | -1.1%/-0.2% | 0%/+1.7% | ||||||||||
White Noise | -0.2%/-0.3% | -0.4%/-0.1% | -4.6%/-1.0% | +7.0%/+4.9% | +22.3%/+5.0% | +0.4%/+0.3% | +1.2%/-0.5% | -4.3%/-1.3% | 0%/-0.4% | ||||||||||
Celebrity Accent | Kanye West | -7.8%/-3.5% | -4.8%/-0.3% | -5.3%/-1.1% | +12.8%/+5.2% | +17.4%/+3.2% | +2.0%/+0.5% | +0.3%/-1.1% | +7.9%/-0.1% | +0.4%/-0.9% | |||||||||
Donald Trump | -8.7%/-3.3% | -4.2%/-0.5% | -4.0%/-1.5% | +3.3%/+2.0% | +20.1%/+3.1% | +2.6%/+0.8% | +0.6%/-0.5% | +6.4%/+0.8% | 0%/0% | ||||||||||
Lucy Liu | -9.5%/-3.6% | -3.2%/-0.1% | -4.4%/-1.0% | -5.9%/-4.3% | +12.4%/+3.7% | -0.3%/+0.6% | +3.3%/-0.1% | +0.8%/+0.1% | +1.5%/+0.4% | ||||||||||
Emotion | Laugh | +4.0%/-0.7% | -4.8%/0% | -4.4%/-0.1% | +2.8%/+0.1% | +23.2%/+5.3% | -0.1%/+0.3% | -1.6%/-0.3% | -6.8%/-2.9% | -0.4%/-0.4% | |||||||||
Scream | -1.1%/-1.8% | -4.7%/-0.8% | -3.7%/-0.8% | +18.0%/+5.2% | +20.7%/+4.5% | +0.4%/+0.5% | +5.5%/+1.0% | -8.1%/-3.4% | -0.4%/0% |
Jailbreak-AudioBench provides a foundation for two critical research directions:
Our analysis reveals that systematically combining audio editing techniques can significantly increase attack success rates. For example, GPT-4o-Audio's vulnerability increases from just 0.76% to 8.40% when exposed to optimized combinations of audio edits. This demonstrates that even commercial-grade models remain susceptible to carefully crafted audio manipulations.
Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including Qwen2-Audio-7B, SALMONN-7B, and GPT-4o-Audio. In each panel, columns represent individual audio samples, and the first 32 rows represent different edited variants of these samples. The penultimate row represents the original unedited audio sample, while the bottom row indicates whether any of the 32 variant queries bypassed the model's defenses. Green: failed jailbreak; Red: successful jailbreaks.
Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including BLSP, SpeechGPT, VITA-1.5, and MiniCPM-o-2.6. The visualization format follows the same convention as above, demonstrating varying levels of vulnerability across different model architectures when exposed to systematically optimized audio editing combinations.
We explore prompt-based defense strategies by prepending safety instructions in audio format. This lightweight approach consistently reduces vulnerability across all evaluated models, though significant improvement opportunities remain for more robust defenses.
Figure: ASR comparison of original and edited audio samples with and without defense. The bars represent the ASR without defense, while the striped bars represent the ASR reduction with the defense applied. The values shown on the bars denote the specific ASR reduction caused by defense.
@misc{cheng2025jailbreakaudiobenchindepthevaluationanalysis,
title={Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models},
author={Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu},
year={2025},
eprint={2501.13772},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2501.13772},
}