hub / github.com/QwenLM/Qwen-Audio

github.com/QwenLM/Qwen-Audio @main sqlite

281 symbols 790 edges 19 files 38 documented · 14%

README

    <a href="https://github.com/QwenLM/Qwen-Audio/raw/main/README_CN.md">中文</a> &nbsp｜ &nbsp English&nbsp&nbsp













<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/audio_logo.jpg" width="400"/>










    Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>&nbsp ｜ Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>&nbsp | &nbsp&nbsp Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a>&nbsp

&nbsp&nbspHomepage&nbsp ｜ &nbsp&nbspPaper&nbsp&nbsp | &nbsp&nbsp&nbspWeChat&nbsp&nbsp | &nbsp&nbspDiscord&nbsp&nbsp

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.

We release two models of the Qwen-Audio series soon:

Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.

News and Updates

2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.

Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:

<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/evaluation.png" width="800"/>

The below is the overal performance：

<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/radar_new.png" width="800"/>

The details of evaluation are as follows:

Automatic Speech Recognition

Dataset	Model	Results (WER)
dev-clean	dev-othoer	test-clean	test-other
Librispeech	SpeechT5	2.1	5.5	2.4	5.8
SpeechNet	-	-	30.7	-
SLM-FT	-	-	2.6	5.0
SALMONN	-	-	2.1	4.9
Qwen-Audio	1.8	4.0	2.0	4.2

Dataset	Model	Results (WER)
dev	test
Aishell1	MMSpeech-base	2.0	2.1
MMSpeech-large	1.6	1.9
Paraformer-large	-	2.0
Qwen-Audio	1.2 (SOTA)	1.3 (SOTA)

Dataset	Model	Results (WER)
Mic	iOS	Android
Aishell2	MMSpeech-base	4.5	3.9	4.0
Paraformer-large	-	2.9	-
Qwen-Audio	3.3	3.1	3.3

Soeech-to-text Translation

Dataset	Model	Results （BLUE)
en-de	de-en	en-zh	zh-en	es-en	fr-en	it-en
CoVoST2	SALMMON	18.6	-	33.1	-	-	-	-
SpeechLLaMA	-	27.1	-	12.3	27.9	25.2	25.9
BLSP	14.1	-	-	-	-	-	-
Qwen-Audio	25.1	33.9	41.5	15.7	39.7	38.5	36.0

Automatic Audio Caption

Dataset	Model	Results
CIDER	SPICE	SPIDEr
Clotho	Pengi	0.416	0.126	0.271
Qwen-Audio	0.441	0.136	0.288

Speech Recognition with Word-level Timestamp

Dataset	Model	AAC (ms)
Industrial Data	Force-aligner	60.3
Paraformer-large-TP	65.3
Qwen-Audio	51.5 (SOTA)

Automatic Scene Classification

Dataset	Model	ACC
Cochlscene	Cochlscene	0.669
Qwen-Audio	0.795 (SOTA)
TUT2017	Pengi	0.353
Qwen-Audio	0.649

Speech Emotion Recognition

Dataset	Model	ACC
Meld	WavLM-large	0.542
Qwen-Audio	0.557

Audio Question & Answer

Dataset	Model	Results
ACC	ACC (binary)
ClothoAQA	ClothoAQA	0.542	0.627
Pengi	-	0.645
Qwen-Audio	0.579	0.749

Vocal Sound Classification

Dataset	Model	ACC
VocalSound	CLAP	0.4945
Pengi	0.6035
Qwen-Audio	0.9289 (SOTA)

Music Note Analysis

Dataset	Model	NS. Qualities (MAP)	NS. Instrument (ACC)
NSynth	Pengi	0.3860	0.5007
Qwen-Audio	0.4742	0.7882

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide TUTORIAL and demo for users.

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)
FFmpeg

Quickstart

Below, we provide simple examples to show how to use Qwen-Audio an

Core symbols most depended-on inside this repo

qwen_generation_utils.py

tokenize

called by 4

eval_audio/evaluate_tokenizer.py

Shape

Method 170

Function 62

Class 49

Languages

Python100%

Modules by API surface

modeling_qwen.py54 symbols

tokenization_qwen.py28 symbols

audio.py27 symbols

eval_audio/heareval_score.py24 symbols

eval_audio/evaluate_srwt.py19 symbols

qwen_generation_utils.py16 symbols

web_demo_audio.py12 symbols

eval_audio/evaluate_asr.py12 symbols

eval_audio/evaluate_note_analysis.py11 symbols

eval_audio/evaluate_caption.py11 symbols

eval_audio/evaluate_vocal_sound.py10 symbols

eval_audio/evaluate_st.py10 symbols

Dependencies from manifests, versioned

gradio3.39.0 · 1×

transformers4.32.0 · 1×

transformers_stream_generator0.0.4 · 1×

For agents

$ claude mcp add Qwen-Audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact