hub / github.com/MoonshotAI/Kimi-Audio

github.com/MoonshotAI/Kimi-Audio @main sqlite

289 symbols 798 edges 43 files 71 documented · 25%

README

<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_logo.png" width="400"/>

Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper

We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.

🔥🔥🔥 News!!

May 29, 2025: 👋 We release a finetuning example of Kimi-Audio-7B.
April 27, 2025: 👋 We release pretrained model weights of Kimi-Audio-7B.
April 25, 2025: 👋 We release the inference code and model weights of Kimi-Audio-7B-Instruct.
April 25, 2025: 👋 We release the audio evaluation toolkit Kimi-Audio-Evalkit. We can easily reproduce the our results and baselines by this toolkit!
April 25, 2025: 👋 We release the technical report of Kimi-Audio.

Introduction
Architecture Overview
Quick Start
Evaluation
Speech Recognition
Audio Understanding
Audio-to-Text Chat
Speech Conversation
Finetune
Evaluation Toolkit
Generation Testset
License
Acknowledgements
Citation
Contact Us

Introduction

Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

Universal Capabilities: Handle diverse tasks like automatic speech recognition (ASR), audio question answering (AQA), automatic audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation.
State-of-the-Art Performance: Achieve SOTA results on numerous audio benchmarks (see Evaluation and the Technical Report).
Large-Scale Pre-training: Pre-train on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
Novel Architecture: Employ a hybrid audio input (continuous acoustic vectors + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
Efficient Inference: Feature a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
Open-Source: Release the code and model checkpoints for both pre-training and instruction fine-tuning, and release a comprehensive evaluation toolkit to foster community research and development.

Architecture Overview

<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_framework.png" width="70%"/>

Kimi-Audio consists of three main components:

Audio Tokenizer: Converts input audio into:
- Discrete semantic tokens (12.5Hz) using vector quantization.
- Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.

Getting Started

Step1: Get the Code

git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
pip install -r requirements.txt

Kimi‑Audio can now be installed directly via pip.

pip install torch
pip install git+https://github.com/MoonshotAI/Kimi-Audio.git

Quick Start

This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.

import soundfile as sf
from kimia_infer.api.kimia import KimiAudio

# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct" 
model = KimiAudio(model_path=model_path, load_detokenizer=True)

# --- 2. Define Sampling Parameters ---
sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
    # You can provide context or instructions as text
    {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
    # Provide the audio file path
    {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]

# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。"


# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
    # Start conversation with an audio query
    {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]

# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以，这很简单。一二三四五六七八九十。"

# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---

messages = [
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
    # This is the first turn output of Kimi-Audio
    {"role": "assistant", "message_type": "audio-text", "content": ["test_audios/multiturn/case2/multiturn_a1.wav", "当然可以，这很简单。一二三四五六七八九十。"]},
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q2.wav"}
]
wav, text = model.generate(messages, **sampling_params, output_type="both")


# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "没问题，继续数下去就是十一十二十三十四十五十六十七十八十九二十。"

print("Kimi-Audio inference examples complete.")

Evaluation

Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks.

The below is the overall performance:

<img src="https://github.com/MoonshotAI/Kimi-Audio/raw/main/assets/kimia_radar_chart.png" width="70%"/>

Here are performances on different benchmarks, you can easily reproduce the our results and baselines by our Kimi-Audio-Evalkit (also see Evaluation Toolkit):

Automatic Speech Recognition (ASR)

Datasets	Model	Performance (WER↓)
LibriSpeech test-clean \| test-other	Qwen2-Audio-base	1.74 \| 4.04
Baichuan-base	3.02 \| 6.04
Step-Audio-chat	3.19 \| 10.67
Qwen2.5-Omni	2.37 \| 4.21
Kimi-Audio	1.28 \| 2.42
Fleurs zh \| en	Qwen2-Audio-base	3.63 \| 5.20
Baichuan-base	4.15 \| 8.07
Step-Audio-chat	4.26 \| 8.56
Qwen2.5-Omni	2.92 \| 4.17
Kimi-Audio	2.69 \| 4.44
AISHELL-1	Qwen2-Audio-base	1.52
Baichuan-base	1.93
Step-Audio-chat	2.14
Qwen2.5-Omni	1.13
Kimi-Audio	0.60
AISHELL-2 ios	Qwen2-Audio-base	3.08
Baichuan-base	3.87
Step-Audio-chat	3.89
Qwen2.5-Omni	2.56
Kimi-Audio	2.56
WenetSpeech test-meeting \| test-net	Qwen2-Audio-base	8.40 \| 7.64
Baichuan-base	13.28 \| 10.13
Step-Audio-chat	10.83 \| 9.47
Qwen2.5-Omni	7.71 \| 6.04
Kimi-Audio	6.28 \| 5.37
Kimi-ASR Internal Testset subset1 \| subset2	Qwen2-Audio-base	2.31 \| 3.24
Baichuan-base	3.41 \| 5.60
Step-Audio-chat	2.82 \| 4.74
Qwen2.5-Omni	1.53 \| 2.68
Kimi-Audio	1.42 \| 2.44

Audio Understanding

Datasets	Model	Performance↑
MMAU music \| sound \| speech	Qwen2-Audio-base	58.98 \| 69.07 \| 52.55
Baichuan-chat	49.10 \| 59.46 \| 42.47
GLM-4-Voice	38.92 \| 43.54 \| 32.43
Step-Audio-chat	49.40 \| 53.75 \| 47.75
Qwen2.5-Omni	62.16 \| 67.57 \| 53.92
Kimi-Audio	61.68 \| 73.27 \| 60.66
ClothoAQA test \| dev	Qwen2-Audio-base	71.73 \| 72.63
Baichuan-chat	48.02 \| 48.16
Step-Audio-chat	45.84 \| 44.98
Qwen2.5-Omni	72.86 \| 73.12
Kimi-Audio	71.24 \| 73.18
VocalSound	Qwen2-Audio-base	93.82
Baichuan-base	58.17
Step-Audio-chat	28.58
Qwen2.5-Omni	93.73
Kimi-Audio	94.85
Nonspeech7k	Qwen2-Audio-base	87.17
Baichuan-chat	59.03
Step-Audio-chat	21.38
Qwen2.5-Omni	69.89
Kimi-Audio	93.93
MELD	Qwen2-Audio-base	51.23
Baichuan-chat	23.59
Step-Audio-chat	33.54
Qwen2.5-Omni	49.83
Kimi-Audio	59.13
TUT2017	Qwen2-Audio-base	33.83
Baichuan-base	27.9
Step-Audio-chat	7.41
Qwen2.5-Omni	43.27
Kimi-Audio	65.25

Core symbols most depended-on inside this repo

audio_append

called by 16

kimia_infer/utils/data.py

from_pretrained

called by 13

kimia_infer/models/detokenizer/bigvgan_wrapper.py

text_append

called by 10

kimia_infer/utils/data.py

detokenize_streaming

called by 10

kimia_infer/models/detokenizer/__init__.py

clear_states

called by 8

kimia_infer/models/detokenizer/__init__.py

load_state_dict

called by 8

kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py

state_dict

called by 7

kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py

_shape

called by 7

kimia_infer/models/tokenizer/whisper_Lv3/modeling_whisper.py

Shape

Method 190

Class 52

Function 46

Route 1

Languages

Python100%

Modules by API surface

kimia_infer/models/tokenizer/whisper_Lv3/modeling_whisper.py49 symbols

finetune_codes/modeling_kimia.py34 symbols

kimia_infer/models/detokenizer/flow_matching/model.py16 symbols

kimia_infer/models/detokenizer/vocoder/bigvgan.py15 symbols

kimia_infer/utils/data.py13 symbols

kimia_infer/models/detokenizer/flow_matching/scheduler.py12 symbols

kimia_infer/models/detokenizer/flow_matching/dit_block.py12 symbols

kimia_infer/models/detokenizer/semantic_fm_prefix_streaming.py11 symbols

kimia_infer/models/detokenizer/__init__.py11 symbols

kimia_infer/models/tokenizer/whisper_Lv3/whisper.py9 symbols

kimia_infer/models/detokenizer/flow_matching/ode_wrapper.py9 symbols

finetune_codes/datasets.py9 symbols

Dependencies from manifests, versioned

conformer0.3.2 · 1×

deepspeed0.16.9 · 1×

diffusers1×

flash-attn1×

flash_attn2.7.4.post1 · 1×

huggingface_hub1×

librosa1×

loguru1×

ninja1×

omegaconf2.3.0 · 1×

sacrebleu1.5.1 · 1×

six1.16.0 · 1×

For agents

$ claude mcp add Kimi-Audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact