MCPcopy Index your code
hub / github.com/MoonshotAI/Kimi-Audio

github.com/MoonshotAI/Kimi-Audio @main sqlite

repository ↗ · DeepWiki ↗
289 symbols 798 edges 43 files 71 documented · 25%
README
<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_logo.png" width="400"/>

Kimi-Audio-7B 🤗  | Kimi-Audio-7B-Instruct 🤗  | 📑 Paper   

We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.

🔥🔥🔥 News!!

  • May 29, 2025: 👋 We release a finetuning example of Kimi-Audio-7B.
  • April 27, 2025: 👋 We release pretrained model weights of Kimi-Audio-7B.
  • April 25, 2025: 👋 We release the inference code and model weights of Kimi-Audio-7B-Instruct.
  • April 25, 2025: 👋 We release the audio evaluation toolkit Kimi-Audio-Evalkit. We can easily reproduce the our results and baselines by this toolkit!
  • April 25, 2025: 👋 We release the technical report of Kimi-Audio.

Table of Contents

Introduction

Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

  • Universal Capabilities: Handle diverse tasks like automatic speech recognition (ASR), audio question answering (AQA), automatic audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation.
  • State-of-the-Art Performance: Achieve SOTA results on numerous audio benchmarks (see Evaluation and the Technical Report).
  • Large-Scale Pre-training: Pre-train on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
  • Novel Architecture: Employ a hybrid audio input (continuous acoustic vectors + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
  • Efficient Inference: Feature a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
  • Open-Source: Release the code and model checkpoints for both pre-training and instruction fine-tuning, and release a comprehensive evaluation toolkit to foster community research and development.

Architecture Overview

<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_framework.png" width="70%"/>

Kimi-Audio consists of three main components:

  1. Audio Tokenizer: Converts input audio into:
    • Discrete semantic tokens (12.5Hz) using vector quantization.
    • Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
  2. Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
  3. Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.

Getting Started

Step1: Get the Code

git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
pip install -r requirements.txt

Kimi‑Audio can now be installed directly via pip.

pip install torch
pip install git+https://github.com/MoonshotAI/Kimi-Audio.git

Quick Start

This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.

import soundfile as sf
from kimia_infer.api.kimia import KimiAudio

# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct" 
model = KimiAudio(model_path=model_path, load_detokenizer=True)

# --- 2. Define Sampling Parameters ---
sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
    # You can provide context or instructions as text
    {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
    # Provide the audio file path
    {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]

# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。"


# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
    # Start conversation with an audio query
    {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]

# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以,这很简单。一二三四五六七八九十。"

# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---

messages = [
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
    # This is the first turn output of Kimi-Audio
    {"role": "assistant", "message_type": "audio-text", "content": ["test_audios/multiturn/case2/multiturn_a1.wav", "当然可以,这很简单。一二三四五六七八九十。"]},
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q2.wav"}
]
wav, text = model.generate(messages, **sampling_params, output_type="both")


# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "没问题,继续数下去就是十一十二十三十四十五十六十七十八十九二十。"

print("Kimi-Audio inference examples complete.")


Evaluation

Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks.

The below is the overall performance:

<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_radar_chart.png" width="70%"/>

Here are performances on different benchmarks, you can easily reproduce the our results and baselines by our Kimi-Audio-Evalkit (also see Evaluation Toolkit):

Automatic Speech Recognition (ASR)

Datasets Model Performance (WER↓)
LibriSpeech test-clean | test-other Qwen2-Audio-base 1.74 | 4.04
Baichuan-base 3.02 | 6.04
Step-Audio-chat 3.19 | 10.67
Qwen2.5-Omni 2.37 | 4.21
Kimi-Audio 1.28 | 2.42
Fleurs zh | en Qwen2-Audio-base 3.63 | 5.20
Baichuan-base 4.15 | 8.07
Step-Audio-chat 4.26 | 8.56
Qwen2.5-Omni 2.92 | 4.17
Kimi-Audio 2.69 | 4.44
AISHELL-1 Qwen2-Audio-base 1.52
Baichuan-base 1.93
Step-Audio-chat 2.14
Qwen2.5-Omni 1.13
Kimi-Audio 0.60
AISHELL-2 ios Qwen2-Audio-base 3.08
Baichuan-base 3.87
Step-Audio-chat 3.89
Qwen2.5-Omni 2.56
Kimi-Audio 2.56
WenetSpeech test-meeting | test-net Qwen2-Audio-base 8.40 | 7.64
Baichuan-base 13.28 | 10.13
Step-Audio-chat 10.83 | 9.47
Qwen2.5-Omni 7.71 | 6.04
Kimi-Audio 6.28 | 5.37
Kimi-ASR Internal Testset subset1 | subset2 Qwen2-Audio-base 2.31 | 3.24
Baichuan-base 3.41 | 5.60
Step-Audio-chat 2.82 | 4.74
Qwen2.5-Omni 1.53 | 2.68
Kimi-Audio 1.42 | 2.44

Audio Understanding

Datasets Model Performance↑
MMAU music | sound | speech Qwen2-Audio-base 58.98 | 69.07 | 52.55
Baichuan-chat 49.10 | 59.46 | 42.47
GLM-4-Voice 38.92 | 43.54 | 32.43
Step-Audio-chat 49.40 | 53.75 | 47.75
Qwen2.5-Omni 62.16 | 67.57 | 53.92
Kimi-Audio 61.68 | 73.27 | 60.66
ClothoAQA test | dev Qwen2-Audio-base 71.73 | 72.63
Baichuan-chat 48.02 | 48.16
Step-Audio-chat 45.84 | 44.98
Qwen2.5-Omni 72.86 | 73.12
Kimi-Audio 71.24 | 73.18
VocalSound Qwen2-Audio-base 93.82
Baichuan-base 58.17
Step-Audio-chat 28.58
Qwen2.5-Omni 93.73
Kimi-Audio 94.85
Nonspeech7k Qwen2-Audio-base 87.17
Baichuan-chat 59.03
Step-Audio-chat 21.38
Qwen2.5-Omni 69.89
Kimi-Audio 93.93
MELD Qwen2-Audio-base 51.23
Baichuan-chat 23.59
Step-Audio-chat 33.54
Qwen2.5-Omni 49.83
Kimi-Audio 59.13
TUT2017 Qwen2-Audio-base 33.83
Baichuan-base 27.9
Step-Audio-chat 7.41
Qwen2.5-Omni 43.27
Kimi-Audio 65.25

Core symbols most depended-on inside this repo

audio_append
called by 16
kimia_infer/utils/data.py
from_pretrained
called by 13
kimia_infer/models/detokenizer/bigvgan_wrapper.py
text_append
called by 10
kimia_infer/utils/data.py
detokenize_streaming
called by 10
kimia_infer/models/detokenizer/__init__.py
clear_states
called by 8
kimia_infer/models/detokenizer/__init__.py
load_state_dict
called by 8
kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py
state_dict
called by 7
kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py
_shape
called by 7
kimia_infer/models/tokenizer/whisper_Lv3/modeling_whisper.py

Shape

Method 190
Class 52
Function 46
Route 1

Languages

Python100%

Modules by API surface

kimia_infer/models/tokenizer/whisper_Lv3/modeling_whisper.py49 symbols
finetune_codes/modeling_kimia.py34 symbols
kimia_infer/models/detokenizer/flow_matching/model.py16 symbols
kimia_infer/models/detokenizer/vocoder/bigvgan.py15 symbols
kimia_infer/utils/data.py13 symbols
kimia_infer/models/detokenizer/flow_matching/scheduler.py12 symbols
kimia_infer/models/detokenizer/flow_matching/dit_block.py12 symbols
kimia_infer/models/detokenizer/semantic_fm_prefix_streaming.py11 symbols
kimia_infer/models/detokenizer/__init__.py11 symbols
kimia_infer/models/tokenizer/whisper_Lv3/whisper.py9 symbols
kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py9 symbols
finetune_codes/datasets.py9 symbols

Dependencies from manifests, versioned

conformer0.3.2 · 1×
deepspeed0.16.9 · 1×
flash-attn
flash_attn2.7.4.post1 · 1×
ninja
omegaconf2.3.0 · 1×
sacrebleu1.5.1 · 1×
six1.16.0 · 1×

For agents

$ claude mcp add Kimi-Audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact