
The lightweight ingestion library for fast, efficient and robust RAG pipelines
Installation • Usage • Chunkers • Integrations • Benchmarks
Tired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!
🚀 Feature-rich: All the CHONKs you'd ever need
🔄 End-to-end: Fetch, CHONK, refine, embed and ship straight to your vector DB!
✨ Easy to use: Install, Import, CHONK
⚡ Fast: CHONK at the speed of light! zooooom
🪶 Light-weight: No bloat, just CHONK
🔌 32+ integrations: Works with your favorite tools and vector DBs out of the box!
💬 ️Multilingual: Out-of-the-box support for 56 languages
☁️ Cloud-Friendly: CHONK locally or in the Cloud
🦛 Cute CHONK mascot: psst it's a pygmy hippo btw
❤️ Moto Moto's favorite python library
Chonkie is a chunking library that "just works" ✨
Using pip:
pip install chonkie
Or using uv (faster):
uv pip install chonkie
Chonkie follows the rule of minimum installs.
Have a favorite chunker? Read our docs to install only what you need.
Don't want to think about it? Simply install all (Not recommended for production environments).
Using pip:
pip install "chonkie[all]"
Or using uv:
uv pip install "chonkie[all]"
Here's a basic example to get you started:
# First import the chunker you want from Chonkie
from chonkie import RecursiveChunker
# Initialize the chunker
chunker = RecursiveChunker()
# Chunk some text
chunks = chunker("Chonkie is the goodest boi! My favorite chunking hippo hehe.")
# Access chunks
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Tokens: {chunk.token_count}")
You can also use the chonkie.Pipeline to chain components together and handle complex workflows. Read more about pipelines in the docs!
from chonkie import Pipeline
# Create a pipeline with multiple chunking and refinement steps
pipe = (
Pipeline()
.chunk_with("recursive", tokenizer="gpt2", chunk_size=2048, recipe="markdown")
.chunk_with("semantic", chunk_size=512)
.refine_with("overlap", context_size=128)
.refine_with("embeddings", embedding_model="sentence-transformers/all-MiniLM-L6-v2")
)
# CHONK some Texts!
doc = pipe.run(texts="Chonkie is the goodest boi! My favorite chunking hippo hehe.")
# Access the processed chunks in the `doc` object
for chunk in doc.chunks:
print(chunk.text)
# Run asynchronously for high-throughput applications
import asyncio
async def main():
doc = await pipe.arun(texts="Chonkie runs fast!")
print(len(doc.chunks))
asyncio.run(main())
Check out more usage examples in the docs!
Run Chonkie as a self-hosted REST API for easy integration into any application:
# Install with API dependencies (includes catsu for multi-provider embeddings)
pip install "chonkie[api,semantic,code,catsu]"
# Start the server using the CLI
chonkie serve
# Or with custom options
chonkie serve --port 3000 --reload --log-level debug
# Or directly with uvicorn
uvicorn chonkie.api.main:app --host 0.0.0.0 --port 8000
Or use Docker:
docker compose up
The API provides endpoints for all chunkers, refineries, and pipelines — reusable workflow configurations stored in a local SQLite database.
# Create a reusable pipeline
curl -X POST http://localhost:8000/v1/pipelines \
-H "Content-Type: application/json" \
-d '{
"name": "rag-chunker",
"steps": [
{"type": "chunk", "chunker": "semantic", "config": {"chunk_size": 512}},
{"type": "refine", "refinery": "embeddings", "config": {"embedding_model": "text-embedding-3-small"}}
]
}'
# List your pipelines
curl http://localhost:8000/v1/pipelines
Interactive documentation is available at /docs when the server is running.
Chonkie provides several chunkers to help you split your text efficiently for RAG applications. Here's a quick overview of the available chunkers:
| Name | Alias | Description |
|---|---|---|
TokenChunker |
token |
Splits text into fixed-size token chunks. |
FastChunker |
fast |
SIMD-accelerated byte-based chunking at 100+ GB/s. Included in the default install. |
SentenceChunker |
sentence |
Splits text into chunks based on sentences. |
RecursiveChunker |
recursive |
Splits text hierarchically using customizable rules to create semantically meaningful chunks. |
SemanticChunker |
semantic |
Splits text into chunks based on semantic similarity. Inspired by the work of Greg Kamradt. |
LateChunker |
late |
Embeds text and then splits it to have better chunk embeddings. |
CodeChunker |
code |
Splits code into structurally meaningful chunks. |
NeuralChunker |
neural |
Splits text using a neural model. |
SlumberChunker |
slumber |
Splits text using an LLM to find semantically meaningful chunks. Also known as "AgenticChunker". |
More on these methods and the approaches taken inside the docs
Chonkie boasts 32+ integrations across tokenizers, embedding providers, LLMs, refineries, porters, vector databases, and utilities, ensuring it fits seamlessly into your existing workflow.
👨🍳 Chefs & 📁 Fetchers! Text preprocessing and data loading!
Chefs handle text preprocessing, while Fetchers load data from various sources.
| Component | Class | Description | Optional Install |
|---|---|---|---|
chef |
TextChef |
Text preprocessing and cleaning. | default |
fetcher |
FileFetcher |
Load text from files and directories. | default |
🏭 Refine your CHONKs with Context and Embeddings! Chonkie supports 2+ refineries!
Refineries help you post-process and enhance your chunks after initial chunking.
| Refinery Name | Class | Description | Optional Install |
|---|---|---|---|
overlap |
OverlapRefinery |
Merge overlapping chunks based on similarity. | default |
embeddings |
EmbeddingsRefinery |
Add embeddings to chunks using any provider. | chonkie[semantic] |
🐴 Exporting CHONKs! Chonkie supports 2+ Porters!
Porters help you save your chunks easily.
| Porter Name | Class | Description | Optional Install |
|---|---|---|---|
json |
JSONPorter |
Export chunks to a JSON file. | default |
datasets |
DatasetsPorter |
Export chunks to HuggingFace datasets. | chonkie[datasets] |
🤝 Shake hands with your DB! Chonkie connects with 8+ vector stores!
Handshakes provide a unified interface to ingest chunks directly into your favorite vector databases.
| Handshake Name | Class | Description | Optional Install |
|---|---|---|---|
chroma |
ChromaHandshake |
Ingest chunks into ChromaDB. | chonkie[chroma] |
elastic |
ElasticHandshake |
Ingest chunks into Elasticsearch. | chonkie[elastic] |
mongodb |
MongoDBHandshake |
Ingest chunks into MongoDB. | chonkie[mongodb] |
pgvector |
PgvectorHandshake |
Ingest chunks into PostgreSQL with pgvector. | chonkie[pgvector] |
pinecone |
PineconeHandshake |
Ingest chunks into Pinecone. | chonkie[pinecone] |
qdrant |
QdrantHandshake |
Ingest chunks into Qdrant. | chonkie[qdrant] |
turbopuffer |
TurbopufferHandshake |
Ingest chunks into Turbopuffer. | chonkie[tpuf] |
weaviate |
WeaviateHandshake |
Ingest chunks into Weaviate. | chonkie[weaviate] |
🪓 Slice 'n' Dice! Chonkie supports 5+ ways to tokenize!
Choose from supported tokenizers or provide your own custom token counting function. Flexibility first!
| Name | Description | Optional Install |
|---|---|---|
character |
Basic character-level tokenizer. Default tokenizer. | default |
word |
Basic word-level tokenizer. | default |
byte |
Byte-level tokenizer operating on UTF-8 encoded bytes. | default |
tokenizers |
Load any tokenizer from the Hugging Face tokenizers library. |
chonkie[tokenizers] |
tiktoken |
Use OpenAI's tiktoken library (e.g., for gpt-4). |
chonkie[tiktoken] |
transformers |
Load tokenizers via AutoTokenizer from HF transformers. |
chonkie[neural] |
default indicates that the feature is available with the default pip install chonkie.
To use a custom token counter, you can pass in any function that takes a string and returns an integer! Something like this:
def custom_token_counter(text: str) -> int:
return len(text)
chunker = RecursiveChunker(tokenizer=custom_token_counter)
You can use this to extend Chonkie to support any tokenization scheme you want!
🧠 Embed like a boss! Chonkie links up with 9+ embedding pals!
Seamlessly works with various embedding model providers. Bring your favorite embeddings to the CHONK party! Use AutoEmbeddings to load models easily.
| Provider / Alias | Class | Description | Optional Install |
|---|---|---|---|
model2vec |
Model2VecEmbeddings |
Use Model2Vec models. |
chonkie[model2vec] |
sentence-transformers |
SentenceTransformerEmbeddings |
Use any sentence-transformers model. |
chonkie[st] |
openai |
OpenAIEmbeddings |
Use OpenAI's embedding API. | chonkie[openai] |
azure-openai |
AzureOpenAIEmbeddings |
Use Azure OpenAI embedding service. | chonkie[azure-openai] |
cohere |
CohereEmbeddings |
Use Cohere's embedding API. | chonkie[cohere] |
gemini |
GeminiEmbeddings |
Use Google's Gemini embedding API. | chonkie[gemini] |
jina |
JinaEmbeddings |
Use Jina AI's embedding API. | chonkie[jina] |
voyageai |
VoyageAIEmbeddings |
Use Voyage AI's embedding API. | chonkie[voyageai] |
litellm |
`LiteLLMEmbedd |
$ claude mcp add chonkie \
-- python -m otcore.mcp_server <graph>