<a href="https://github.com/QwenLM/Qwen-Audio/raw/main/README_CN.md">中文</a>  |   English  
<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/audio_logo.jpg" width="400"/>
Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>  | Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>  |    Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a> 
  Homepage  |   Paper   |    WeChat   |   Discord  
Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:
Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.

We release two models of the Qwen-Audio series soon:
We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:
<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/evaluation.png" width="800"/>
The below is the overal performance:
<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/radar_new.png" width="800"/>
The details of evaluation are as follows:
| Dataset | Model | Results (WER) | |||
|---|---|---|---|---|---|
| dev-clean | dev-othoer | test-clean | test-other | ||
| Librispeech | SpeechT5 | 2.1 | 5.5 | 2.4 | 5.8 |
| SpeechNet | - | - | 30.7 | - | |
| SLM-FT | - | - | 2.6 | 5.0 | |
| SALMONN | - | - | 2.1 | 4.9 | |
| Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 |
| Dataset | Model | Results (WER) | |
|---|---|---|---|
| dev | test | ||
| Aishell1 | MMSpeech-base | 2.0 | 2.1 |
| MMSpeech-large | 1.6 | 1.9 | |
| Paraformer-large | - | 2.0 | |
| Qwen-Audio | 1.2 (SOTA) | 1.3 (SOTA) |
| Dataset | Model | Results (WER) | ||
|---|---|---|---|---|
| Mic | iOS | Android | ||
| Aishell2 | MMSpeech-base | 4.5 | 3.9 | 4.0 |
| Paraformer-large | - | 2.9 | - | |
| Qwen-Audio | 3.3 | 3.1 | 3.3 |
| Dataset | Model | Results (BLUE) | ||||||
|---|---|---|---|---|---|---|---|---|
| en-de | de-en | en-zh | zh-en | es-en | fr-en | it-en | ||
| CoVoST2 | SALMMON | 18.6 | - | 33.1 | - | - | - | - |
| SpeechLLaMA | - | 27.1 | - | 12.3 | 27.9 | 25.2 | 25.9 | |
| BLSP | 14.1 | - | - | - | - | - | - | |
| Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 | 39.7 | 38.5 | 36.0 |
| Dataset | Model | Results | ||
|---|---|---|---|---|
| CIDER | SPICE | SPIDEr | ||
| Clotho | Pengi | 0.416 | 0.126 | 0.271 |
| Qwen-Audio | 0.441 | 0.136 | 0.288 |
| Dataset | Model | AAC (ms) |
|---|---|---|
| Industrial Data | Force-aligner | 60.3 |
| Paraformer-large-TP | 65.3 | |
| Qwen-Audio | 51.5 (SOTA) |
| Dataset | Model | ACC |
|---|---|---|
| Cochlscene | Cochlscene | 0.669 |
| Qwen-Audio | 0.795 (SOTA) | |
| TUT2017 | Pengi | 0.353 |
| Qwen-Audio | 0.649 |
| Dataset | Model | ACC |
|---|---|---|
| Meld | WavLM-large | 0.542 |
| Qwen-Audio | 0.557 |
| Dataset | Model | Results | |
|---|---|---|---|
| ACC | ACC (binary) | ||
| ClothoAQA | ClothoAQA | 0.542 | 0.627 |
| Pengi | - | 0.645 | |
| Qwen-Audio | 0.579 | 0.749 |
| Dataset | Model | ACC |
|---|---|---|
| VocalSound | CLAP | 0.4945 |
| Pengi | 0.6035 | |
| Qwen-Audio | 0.9289 (SOTA) |
| Dataset | Model | NS. Qualities (MAP) | NS. Instrument (ACC) |
|---|---|---|---|
| NSynth | Pengi | 0.3860 | 0.5007 |
| Qwen-Audio | 0.4742 | 0.7882 |
We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.
To evaluate the chat abilities of Qwen-Audio-Chat, we provide TUTORIAL and demo for users.
Below, we provide simple examples to show how to use Qwen-Audio an
$ claude mcp add Qwen-Audio \
-- python -m otcore.mcp_server <graph>