<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_logo.png" width="400"/>
Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper
We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_framework.png" width="70%"/>
Kimi-Audio consists of three main components:
git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
pip install -r requirements.txt
Kimi‑Audio can now be installed directly via pip.
pip install torch
pip install git+https://github.com/MoonshotAI/Kimi-Audio.git
This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.
import soundfile as sf
from kimia_infer.api.kimia import KimiAudio
# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)
# --- 2. Define Sampling Parameters ---
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
# You can provide context or instructions as text
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
# Provide the audio file path
{"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]
# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。"
# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
# Start conversation with an audio query
{"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]
# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以,这很简单。一二三四五六七八九十。"
# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---
messages = [
{"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
# This is the first turn output of Kimi-Audio
{"role": "assistant", "message_type": "audio-text", "content": ["test_audios/multiturn/case2/multiturn_a1.wav", "当然可以,这很简单。一二三四五六七八九十。"]},
{"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q2.wav"}
]
wav, text = model.generate(messages, **sampling_params, output_type="both")
# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "没问题,继续数下去就是十一十二十三十四十五十六十七十八十九二十。"
print("Kimi-Audio inference examples complete.")
Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks.
The below is the overall performance:
<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_radar_chart.png" width="70%"/>
Here are performances on different benchmarks, you can easily reproduce the our results and baselines by our Kimi-Audio-Evalkit (also see Evaluation Toolkit):
| Datasets | Model | Performance (WER↓) |
|---|---|---|
| LibriSpeech test-clean | test-other | Qwen2-Audio-base | 1.74 | 4.04 |
| Baichuan-base | 3.02 | 6.04 | |
| Step-Audio-chat | 3.19 | 10.67 | |
| Qwen2.5-Omni | 2.37 | 4.21 | |
| Kimi-Audio | 1.28 | 2.42 | |
| Fleurs zh | en | Qwen2-Audio-base | 3.63 | 5.20 |
| Baichuan-base | 4.15 | 8.07 | |
| Step-Audio-chat | 4.26 | 8.56 | |
| Qwen2.5-Omni | 2.92 | 4.17 | |
| Kimi-Audio | 2.69 | 4.44 | |
| AISHELL-1 | Qwen2-Audio-base | 1.52 |
| Baichuan-base | 1.93 | |
| Step-Audio-chat | 2.14 | |
| Qwen2.5-Omni | 1.13 | |
| Kimi-Audio | 0.60 | |
| AISHELL-2 ios | Qwen2-Audio-base | 3.08 |
| Baichuan-base | 3.87 | |
| Step-Audio-chat | 3.89 | |
| Qwen2.5-Omni | 2.56 | |
| Kimi-Audio | 2.56 | |
| WenetSpeech test-meeting | test-net | Qwen2-Audio-base | 8.40 | 7.64 |
| Baichuan-base | 13.28 | 10.13 | |
| Step-Audio-chat | 10.83 | 9.47 | |
| Qwen2.5-Omni | 7.71 | 6.04 | |
| Kimi-Audio | 6.28 | 5.37 | |
| Kimi-ASR Internal Testset subset1 | subset2 | Qwen2-Audio-base | 2.31 | 3.24 |
| Baichuan-base | 3.41 | 5.60 | |
| Step-Audio-chat | 2.82 | 4.74 | |
| Qwen2.5-Omni | 1.53 | 2.68 | |
| Kimi-Audio | 1.42 | 2.44 |
| Datasets | Model | Performance↑ |
|---|---|---|
| MMAU music | sound | speech | Qwen2-Audio-base | 58.98 | 69.07 | 52.55 |
| Baichuan-chat | 49.10 | 59.46 | 42.47 | |
| GLM-4-Voice | 38.92 | 43.54 | 32.43 | |
| Step-Audio-chat | 49.40 | 53.75 | 47.75 | |
| Qwen2.5-Omni | 62.16 | 67.57 | 53.92 | |
| Kimi-Audio | 61.68 | 73.27 | 60.66 | |
| ClothoAQA test | dev | Qwen2-Audio-base | 71.73 | 72.63 |
| Baichuan-chat | 48.02 | 48.16 | |
| Step-Audio-chat | 45.84 | 44.98 | |
| Qwen2.5-Omni | 72.86 | 73.12 | |
| Kimi-Audio | 71.24 | 73.18 | |
| VocalSound | Qwen2-Audio-base | 93.82 |
| Baichuan-base | 58.17 | |
| Step-Audio-chat | 28.58 | |
| Qwen2.5-Omni | 93.73 | |
| Kimi-Audio | 94.85 | |
| Nonspeech7k | Qwen2-Audio-base | 87.17 |
| Baichuan-chat | 59.03 | |
| Step-Audio-chat | 21.38 | |
| Qwen2.5-Omni | 69.89 | |
| Kimi-Audio | 93.93 | |
| MELD | Qwen2-Audio-base | 51.23 |
| Baichuan-chat | 23.59 | |
| Step-Audio-chat | 33.54 | |
| Qwen2.5-Omni | 49.83 | |
| Kimi-Audio | 59.13 | |
| TUT2017 | Qwen2-Audio-base | 33.83 |
| Baichuan-base | 27.9 | |
| Step-Audio-chat | 7.41 | |
| Qwen2.5-Omni | 43.27 | |
| Kimi-Audio | 65.25 |
$ claude mcp add Kimi-Audio \
-- python -m otcore.mcp_server <graph>