MCPcopy
hub / github.com/QwenLM/Qwen-Audio

github.com/QwenLM/Qwen-Audio @main sqlite

repository ↗ · DeepWiki ↗
281 symbols 790 edges 19 files 38 documented · 14%
README
    <a href="https://github.com/QwenLM/Qwen-Audio/raw/main/README_CN.md">中文</a> &nbsp| &nbsp English&nbsp&nbsp













<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/audio_logo.jpg" width="400"/>










    Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>&nbsp | Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>&nbsp | &nbsp&nbsp Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a>&nbsp

&nbsp&nbspHomepage&nbsp | &nbsp&nbspPaper&nbsp&nbsp | &nbsp&nbsp&nbspWeChat&nbsp&nbsp | &nbsp&nbspDiscord&nbsp&nbsp

PWC PWC PWC PWC PWC PWC

PWC

PWC

PWC PWC PWC

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:

  • Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
  • Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
  • Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
  • Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage.

We release two models of the Qwen-Audio series soon:

  • Qwen-Audio: The pre-trained multi-task audio understanding model uses Qwen-7B as the initialization of the LLM, and Whisper-large-v2 as the initialization of the audio encoder.
  • Qwen-Audio-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-Audio-Chat supports more flexible interaction, such as multiple audio inputs, multi-round question answering, and creative capabilities.

News and Updates

  • 2023.11.30 🔥 We have released the checkpoints of both Qwen-Audio and Qwen-Audio-Chat on ModelScope and Hugging Face.
  • 2023.11.15 🎉 We released a paper for details about Qwen-Audio and Qwen-Audio-Chat model, including training details and model performance.

Evaluation

We evaluated the Qwen-Audio's abilities on 12 standard benchmarks as follows:

<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/evaluation.png" width="800"/>

The below is the overal performance:

<img src="https://github.com/QwenLM/Qwen-Audio/raw/main/assets/radar_new.png" width="800"/>

The details of evaluation are as follows:

Automatic Speech Recognition

Dataset Model Results (WER)
dev-clean dev-othoer test-clean test-other
Librispeech SpeechT5 2.1 5.5 2.4 5.8
SpeechNet - - 30.7 -
SLM-FT - - 2.6 5.0
SALMONN - - 2.1 4.9
Qwen-Audio 1.8 4.0 2.0 4.2
Dataset Model Results (WER)
dev test
Aishell1 MMSpeech-base 2.0 2.1
MMSpeech-large 1.6 1.9
Paraformer-large - 2.0
Qwen-Audio 1.2 (SOTA) 1.3 (SOTA)
Dataset Model Results (WER)
Mic iOS Android
Aishell2 MMSpeech-base 4.5 3.9 4.0
Paraformer-large - 2.9 -
Qwen-Audio 3.3 3.1 3.3

Soeech-to-text Translation

Dataset Model Results (BLUE)
en-de de-en en-zh zh-en es-en fr-en it-en
CoVoST2 SALMMON 18.6 - 33.1 - - - -
SpeechLLaMA - 27.1 - 12.3 27.9 25.2 25.9
BLSP 14.1 - - - - - -
Qwen-Audio 25.1 33.9 41.5 15.7 39.7 38.5 36.0

Automatic Audio Caption

Dataset Model Results
CIDER SPICE SPIDEr
Clotho Pengi 0.416 0.126 0.271
Qwen-Audio 0.441 0.136 0.288

Speech Recognition with Word-level Timestamp

Dataset Model AAC (ms)
Industrial Data Force-aligner 60.3
Paraformer-large-TP 65.3
Qwen-Audio 51.5 (SOTA)

Automatic Scene Classification

Dataset Model ACC
Cochlscene Cochlscene 0.669
Qwen-Audio 0.795 (SOTA)
TUT2017 Pengi 0.353
Qwen-Audio 0.649

Speech Emotion Recognition

Dataset Model ACC
Meld WavLM-large 0.542
Qwen-Audio 0.557

Audio Question & Answer

Dataset Model Results
ACC ACC (binary)
ClothoAQA ClothoAQA 0.542 0.627
Pengi - 0.645
Qwen-Audio 0.579 0.749

Vocal Sound Classification

Dataset Model ACC
VocalSound CLAP 0.4945
Pengi 0.6035
Qwen-Audio 0.9289 (SOTA)

Music Note Analysis

Dataset Model NS. Qualities (MAP) NS. Instrument (ACC)
NSynth Pengi 0.3860 0.5007
Qwen-Audio 0.4742 0.7882

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

Evaluation of Chat

To evaluate the chat abilities of Qwen-Audio-Chat, we provide TUTORIAL and demo for users.

Requirements

  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users)
  • FFmpeg

Quickstart

Below, we provide simple examples to show how to use Qwen-Audio an

Core symbols most depended-on inside this repo

from_pretrained
called by 23
modeling_qwen.py
encode
called by 13
audio.py
process_audio
called by 11
tokenization_qwen.py
generate
called by 10
modeling_qwen.py
_parse_text
called by 6
web_demo_audio.py
apply_rotary_pos_emb
called by 4
modeling_qwen.py
_tokenize_str
called by 4
qwen_generation_utils.py
tokenize
called by 4
eval_audio/evaluate_tokenizer.py

Shape

Method 170
Function 62
Class 49

Languages

Python100%

Modules by API surface

modeling_qwen.py54 symbols
tokenization_qwen.py28 symbols
audio.py27 symbols
eval_audio/heareval_score.py24 symbols
eval_audio/evaluate_srwt.py19 symbols
qwen_generation_utils.py16 symbols
web_demo_audio.py12 symbols
eval_audio/evaluate_asr.py12 symbols
eval_audio/evaluate_note_analysis.py11 symbols
eval_audio/evaluate_caption.py11 symbols
eval_audio/evaluate_vocal_sound.py10 symbols
eval_audio/evaluate_st.py10 symbols

Dependencies from manifests, versioned

gradio3.39.0 · 1×
transformers4.32.0 · 1×
transformers_stream_generator0.0.4 · 1×

For agents

$ claude mcp add Qwen-Audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact