AudioBench Icon Jailbreak-AudioBench:
In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng1*, Erjia Xiao1*, Jing Shao4*, Yichi Wang5, Le Yang3, Chao Shen3, Philip Torr2, Jindong Gu2†, Renjing Xu1†
1Hong Kong University of Science and Technology (Guangzhou), 2University of Oxford, 3Xi'an Jiaotong University, 4Northeastern University, 5Beijing University of Technology
* Equal contribution, † Correspondence authors
Download Dataset Download Plus Dataset Download Code

📖 Abstract

Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attacks. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce Jailbreak-AudioBench, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also a range of audio editing techniques. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs, establishing the most comprehensive audio jailbreak benchmark to date. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.

Framework Overview

AudioBench Logo Jailbreak-AudioBench Benchmark

🔍 Overview

Dataset Visualization

📑 Dataset Composition

Jailbreak-AudioBench collects 4,700 base harmful queries from four text-modal jailbreak sets (AdvBench, MM-SafetyBench, RedTeam-2K, Safebench) and renders them into speech via gTTS. Each utterance is further processed by seven families of audio-specific edits (tone, speed, emphasis, intonation, background noise, celebrity accent, emotion), yielding 94,800 audio samples covering both Explicit and Implicit jailbreak tasks. The dataset also includes an equal number of defended versions of these audio samples to explore defense strategies against audio editing jailbreaks and a subset for the proposed Query-based Audio Editing Jailbreak Method.

The scale of Jailbreak-AudioBench Dataset

Subset Base Audio Types of Editing Categories Editing Sum Total Sum
Explicit Subtype 2,497 4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories 49,940 52,437
Implicit Subtype 2,203 44,060 46,263
Explicit Defense 2,497 49,940 52,437
Explicit Small 262 2×Speed × 2×Emphasis × 2×Background Noise × (2×Celebrity Accent + 2×Emotion) = 32 categories 8,384 8,646

Additional datasets in the Plus version of Jailbreak-AudioBench

Subset Base Audio Types of Editing Categories Editing Sum Total Sum
Explicit Small (GPT-4o Eval) 262 4×Tone + 3×Intonation + 2×Speed + 3×Emphasis + 3×Background Noise + 3×Celebrity Accent + 2×Emotion = 20 categories 5,240 5,502
Implicit Small (GPT-4o Eval) 237 4,740 4,977
Implicit Defense 2,203 44,060 46,263

🚀 Evaluation & Analysis

We comprehensively evaluate 9 state-of-the-art Large Audio-Language Models (LALMs) using Jailbreak-AudioBench. Three key observations emerge:

  • Disparity in LALM Susceptibility to Audio Editing Jailbreak. Models exhibit vastly different levels of susceptibility to audio editing jailbreak. While SALMONN demonstrates high vulnerability with substantial ASR increases across multiple audio editing types, SpeechGPT, Qwen2-Audio, and BLSP show remarkable resilience.
  • Audio-specific Brittleness. Even seemingly innocuous edits (e.g., slight pitch shifts or background noise) can drastically raise the Attack Success Rate (ASR) in vulnerable models, revealing critical security considerations in audio model design.
  • Analysis and Insights. Our t-SNE visualization analysis reveals that vulnerability differences stem from how models process edited audio through their transformer layers, with robust models effectively normalizing audio editing variations while vulnerable models maintain distinct editing-based feature clusters.

🎧 Interactive Audio Processing Demo

Experience our audio jailbreak techniques step by step. Select a base harmful query, then choose a processing method, and finally select specific parameters to hear how audio editing can potentially bypass model safety mechanisms.

All three queries now have available audio samples

Audio Result

0:00
0:00
Current Audio Exploit Script (Original)
Current Selection: Exploit Script - Original - No processing

🏆 Detailed ASR Performance by Audio Editing Types

BLSP SpeechGPT Qwen2-Audio SALMONN-7B SALMONN-13B VITA-1.5 R1-AQA MiniCPM-o-2.6 GPT-4o-Audio
Original 47.5%/18.25% 14.1%/2.45% 16.8%/6.76% 31.4%/14.3% 31.3%/12.89% 3.7%/2.77% 12.6%/7.17% 18.2%/9.03% 0.7%/0.8%
Emphasis Volume*2 +1.5%/-1.6% -0.3%/0% -1.6%/-0.7% +14.4%/+3.5% +16.1%/+2.3% +0.4%/+0.3% +1.4%/-0.3% -1.1%/+0.4% +0.4%/+0.9%
Volume*5 +0.5%/-0.4% -0.8%/-0.1% -4.3%/0% +21.3%/+5.6% +20.5%/+3.4% +0.2%/0% +0.6%/-0.5% -0.7%/-0.2% +0.4%/0%
Volume*10 +0.6%/-1.2% -5.0%/-0.4% -4.0%/-1.0% +21.4%/+5.9% +19.9%/+3.5% 0%/+0.5% +2.0%/-0.4% +1.0%/-0.8% +0.4%/+0.4%
Speed Rate*0.5 +2.8%/+0.6% -0.8%/-0.4% -4.4%/-1.9% +13.3%/+1.9% +16.8%/+3.0% +2.2%/+0.6% +1.0%/-1.1% +1.6%/+0.4% +0.4%/0%
Rate*1.5 -2.6%/+2.7% +0.2%/-0.1% +1.1%/+0.1% +14.3%/-4.2% -22.9%/-8.4% -0.5%/+0.4% +2.0%/+0.4% -2.2%/-0.4% +1.5%/-0.4%
Intonation Interval+2 -4.3%/-2.0% -8.1%/-1.0% -5.1%/-0.7% -27.6%/-11.0% -1.0%/-1.4% +5.6%/+1.3% +1.6%/-0.5% +0.3%/-0.6% +0.4%/+0.4%
Interval+3 -8.0%/-3.4% -11.3%/-0.8% -4.4%/-1.9% -27.0%/-11.1% +4.4%/+0.1% +5.2%/+0.5% +3.0%/-0.3% +1.4%/-1.1% +1.2%/+0.4%
Interval+4 -13.6%/-3.1% -11.8%/-0.9% -3.3%/-0.5% -25.0%/-11.3% +11.7%/+2.0% +3.7%/+0.1% +4.7%/+0.1% +3.8%/-0.4% +1.5%/+1.3%
Tone Semitone -8 -3.1%/-1.4% -3.9%/-0.2% -5.1%/+0.1% +2.8%/-0.8% +11.5%/+1.3% +3.0%/+0.3% +0.5%/+0.5% -0.2%/-0.3% 0%/-0.4%
Semitone -4 +1.5%/-0.5% -0.3%/-0.1% -2.6%/+0.4% +1.0%/-0.8% +6.0%/+1.2% -0.3%/+0.3% -0.4%/-1.4% +0.5%/-0.4% +0.4%/-0.4%
Semitone +4 -0.4%/-0.2% -5.6%/-0.5% -5.1%/-1.0% +3.6%/+1.4% +17.6%/+3.6% +0.5%/+0.4% +1.0%/-0.7% -0.3%/-1.1% +0.8%/0%
Semitone +8 -2.4%/-1.2% -13.6%/-2.1% -3.2%/-1.1% +8.8%/+2.0% +24.1%/+4.7% +4.4%/+0.4% +1.5%/-0.7% +7.9%/+0.3% +1.2%/+0.9%
Background Noise Crowd Noise +0.8%/-1.1% -6.5%/-0.2% -7.7%/-2.0% +16.1%/+5.6% +27.6%/+7.7% +4.4%/+0.9% -1.6%/-2.0% +1.9%/+0.5% +0.8%/+0.4%
Machine Noise +0.7%/+0.4% -5.5%/-0.2% -6.1%/-1.3% +20.3%/+5.9% +28.6%/+9.2% +0.2%/+0.3% -2.2%/-1.4% -1.1%/-0.2% 0%/+1.7%
White Noise -0.2%/-0.3% -0.4%/-0.1% -4.6%/-1.0% +7.0%/+4.9% +22.3%/+5.0% +0.4%/+0.3% +1.2%/-0.5% -4.3%/-1.3% 0%/-0.4%
Celebrity Accent Kanye West -7.8%/-3.5% -4.8%/-0.3% -5.3%/-1.1% +12.8%/+5.2% +17.4%/+3.2% +2.0%/+0.5% +0.3%/-1.1% +7.9%/-0.1% +0.4%/-0.9%
Donald Trump -8.7%/-3.3% -4.2%/-0.5% -4.0%/-1.5% +3.3%/+2.0% +20.1%/+3.1% +2.6%/+0.8% +0.6%/-0.5% +6.4%/+0.8% 0%/0%
Lucy Liu -9.5%/-3.6% -3.2%/-0.1% -4.4%/-1.0% -5.9%/-4.3% +12.4%/+3.7% -0.3%/+0.6% +3.3%/-0.1% +0.8%/+0.1% +1.5%/+0.4%
Emotion Laugh +4.0%/-0.7% -4.8%/0% -4.4%/-0.1% +2.8%/+0.1% +23.2%/+5.3% -0.1%/+0.3% -1.6%/-0.3% -6.8%/-2.9% -0.4%/-0.4%
Scream -1.1%/-1.8% -4.7%/-0.8% -3.7%/-0.8% +18.0%/+5.2% +20.7%/+4.5% +0.4%/+0.5% +5.5%/+1.0% -8.1%/-3.4% -0.4%/0%

🔬 Research Applications

Jailbreak-AudioBench provides a foundation for two critical research directions:

1. Query-based Audio Editing Jailbreak Method for LALMs

Our analysis reveals that systematically combining audio editing techniques can significantly increase attack success rates. For example, GPT-4o-Audio's vulnerability increases from just 0.76% to 8.40% when exposed to optimized combinations of audio edits. This demonstrates that even commercial-grade models remain susceptible to carefully crafted audio manipulations.

Query-based Audio Editing Jailbreak Results - Part 1

Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including Qwen2-Audio-7B, SALMONN-7B, and GPT-4o-Audio. In each panel, columns represent individual audio samples, and the first 32 rows represent different edited variants of these samples. The penultimate row represents the original unedited audio sample, while the bottom row indicates whether any of the 32 variant queries bypassed the model's defenses. Green: failed jailbreak; Red: successful jailbreaks.

Query-based Audio Editing Jailbreak Results - Part 2

Figure: ASR Performance of the Query-based Audio Editing Jailbreak method on models including BLSP, SpeechGPT, VITA-1.5, and MiniCPM-o-2.6. The visualization format follows the same convention as above, demonstrating varying levels of vulnerability across different model architectures when exposed to systematically optimized audio editing combinations.

2. Development of Defense Methods

We explore prompt-based defense strategies by prepending safety instructions in audio format. This lightweight approach consistently reduces vulnerability across all evaluated models, though significant improvement opportunities remain for more robust defenses.

Defense ASR Comparison

Figure: ASR comparison of original and edited audio samples with and without defense. The bars represent the ASR without defense, while the striped bars represent the ASR reduction with the defense applied. The values shown on the bars denote the specific ASR reduction caused by defense.

BibTeX

@misc{cheng2025jailbreakaudiobenchindepthevaluationanalysis,
      title={Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models}, 
      author={Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu},
      year={2025},
      eprint={2501.13772},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2501.13772}, 
}