hub / github.com/feyninc/chonkie

github.com/feyninc/chonkie @v1.6.8 sqlite

repository ↗ · DeepWiki ↗ · release v1.6.8 ↗

2,875 symbols 10,777 edges 216 files 2,528 documented · 88%

README

Chonkie Logo

🦛 Chonkie ✨

The lightweight ingestion library for fast, efficient and robust RAG pipelines

Installation • Usage • Chunkers • Integrations • Benchmarks

Tired of making your gazillionth chunker? Sick of the overhead of large libraries? Want to chunk your texts quickly and efficiently? Chonkie the mighty hippo is here to help!

🚀 Feature-rich: All the CHONKs you'd ever need

🔄 End-to-end: Fetch, CHONK, refine, embed and ship straight to your vector DB!

✨ Easy to use: Install, Import, CHONK

⚡ Fast: CHONK at the speed of light! zooooom

🪶 Light-weight: No bloat, just CHONK

🔌 32+ integrations: Works with your favorite tools and vector DBs out of the box!

💬 ️Multilingual: Out-of-the-box support for 56 languages

☁️ Cloud-Friendly: CHONK locally or in the Cloud

🦛 Cute CHONK mascot: psst it's a pygmy hippo btw

❤️ Moto Moto's favorite python library

Chonkie is a chunking library that "just works" ✨

📦 Installation

Basic Installation

Using pip:

pip install chonkie

Or using uv (faster):

uv pip install chonkie

Full Installation

Chonkie follows the rule of minimum installs. Have a favorite chunker? Read our docs to install only what you need. Don't want to think about it? Simply install all (Not recommended for production environments).

Using pip:

pip install "chonkie[all]"

Or using uv:

uv pip install "chonkie[all]"

🚀 Usage

Basic Usage

Here's a basic example to get you started:

# First import the chunker you want from Chonkie
from chonkie import RecursiveChunker

# Initialize the chunker
chunker = RecursiveChunker()

# Chunk some text
chunks = chunker("Chonkie is the goodest boi! My favorite chunking hippo hehe.")

# Access chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

Pipeline Usage

You can also use the chonkie.Pipeline to chain components together and handle complex workflows. Read more about pipelines in the docs!

from chonkie import Pipeline

# Create a pipeline with multiple chunking and refinement steps
pipe = (
    Pipeline()
    .chunk_with("recursive", tokenizer="gpt2", chunk_size=2048, recipe="markdown")
    .chunk_with("semantic", chunk_size=512)
    .refine_with("overlap", context_size=128)
    .refine_with("embeddings", embedding_model="sentence-transformers/all-MiniLM-L6-v2")
)

# CHONK some Texts!
doc = pipe.run(texts="Chonkie is the goodest boi! My favorite chunking hippo hehe.")

# Access the processed chunks in the `doc` object
for chunk in doc.chunks:
    print(chunk.text)

# Run asynchronously for high-throughput applications
import asyncio

async def main():
    doc = await pipe.arun(texts="Chonkie runs fast!")
    print(len(doc.chunks))

asyncio.run(main())

Check out more usage examples in the docs!

🌐 API Server

Run Chonkie as a self-hosted REST API for easy integration into any application:

# Install with API dependencies (includes catsu for multi-provider embeddings)
pip install "chonkie[api,semantic,code,catsu]"

# Start the server using the CLI
chonkie serve

# Or with custom options
chonkie serve --port 3000 --reload --log-level debug

# Or directly with uvicorn
uvicorn chonkie.api.main:app --host 0.0.0.0 --port 8000

Or use Docker:

docker compose up

The API provides endpoints for all chunkers, refineries, and pipelines — reusable workflow configurations stored in a local SQLite database.

# Create a reusable pipeline
curl -X POST http://localhost:8000/v1/pipelines \
  -H "Content-Type: application/json" \
  -d '{
    "name": "rag-chunker",
    "steps": [
      {"type": "chunk", "chunker": "semantic", "config": {"chunk_size": 512}},
      {"type": "refine", "refinery": "embeddings", "config": {"embedding_model": "text-embedding-3-small"}}
    ]
  }'

# List your pipelines
curl http://localhost:8000/v1/pipelines

Interactive documentation is available at /docs when the server is running.

✂️ Chunkers

Chonkie provides several chunkers to help you split your text efficiently for RAG applications. Here's a quick overview of the available chunkers:

Name	Alias	Description
`TokenChunker`	`token`	Splits text into fixed-size token chunks.
`FastChunker`	`fast`	SIMD-accelerated byte-based chunking at 100+ GB/s. Included in the default install.
`SentenceChunker`	`sentence`	Splits text into chunks based on sentences.
`RecursiveChunker`	`recursive`	Splits text hierarchically using customizable rules to create semantically meaningful chunks.
`SemanticChunker`	`semantic`	Splits text into chunks based on semantic similarity. Inspired by the work of Greg Kamradt.
`LateChunker`	`late`	Embeds text and then splits it to have better chunk embeddings.
`CodeChunker`	`code`	Splits code into structurally meaningful chunks.
`NeuralChunker`	`neural`	Splits text using a neural model.
`SlumberChunker`	`slumber`	Splits text using an LLM to find semantically meaningful chunks. Also known as "AgenticChunker".

More on these methods and the approaches taken inside the docs

🔌 Integrations

Chonkie boasts 32+ integrations across tokenizers, embedding providers, LLMs, refineries, porters, vector databases, and utilities, ensuring it fits seamlessly into your existing workflow.

👨‍🍳 Chefs & 📁 Fetchers! Text preprocessing and data loading!

Chefs handle text preprocessing, while Fetchers load data from various sources.

Component	Class	Description	Optional Install
`chef`	`TextChef`	Text preprocessing and cleaning.	`default`
`fetcher`	`FileFetcher`	Load text from files and directories.	`default`

🏭 Refine your CHONKs with Context and Embeddings! Chonkie supports 2+ refineries!

Refineries help you post-process and enhance your chunks after initial chunking.

Refinery Name	Class	Description	Optional Install
`overlap`	`OverlapRefinery`	Merge overlapping chunks based on similarity.	`default`
`embeddings`	`EmbeddingsRefinery`	Add embeddings to chunks using any provider.	`chonkie[semantic]`

🐴 Exporting CHONKs! Chonkie supports 2+ Porters!

Porters help you save your chunks easily.

Porter Name	Class	Description	Optional Install
`json`	`JSONPorter`	Export chunks to a JSON file.	`default`
`datasets`	`DatasetsPorter`	Export chunks to HuggingFace datasets.	`chonkie[datasets]`

🤝 Shake hands with your DB! Chonkie connects with 8+ vector stores!

Handshakes provide a unified interface to ingest chunks directly into your favorite vector databases.

Handshake Name	Class	Description	Optional Install
`chroma`	`ChromaHandshake`	Ingest chunks into ChromaDB.	`chonkie[chroma]`
`elastic`	`ElasticHandshake`	Ingest chunks into Elasticsearch.	`chonkie[elastic]`
`mongodb`	`MongoDBHandshake`	Ingest chunks into MongoDB.	`chonkie[mongodb]`
`pgvector`	`PgvectorHandshake`	Ingest chunks into PostgreSQL with pgvector.	`chonkie[pgvector]`
`pinecone`	`PineconeHandshake`	Ingest chunks into Pinecone.	`chonkie[pinecone]`
`qdrant`	`QdrantHandshake`	Ingest chunks into Qdrant.	`chonkie[qdrant]`
`turbopuffer`	`TurbopufferHandshake`	Ingest chunks into Turbopuffer.	`chonkie[tpuf]`
`weaviate`	`WeaviateHandshake`	Ingest chunks into Weaviate.	`chonkie[weaviate]`

🪓 Slice 'n' Dice! Chonkie supports 5+ ways to tokenize!

Choose from supported tokenizers or provide your own custom token counting function. Flexibility first!

Name	Description	Optional Install
`character`	Basic character-level tokenizer. Default tokenizer.	`default`
`word`	Basic word-level tokenizer.	`default`
`byte`	Byte-level tokenizer operating on UTF-8 encoded bytes.	`default`
`tokenizers`	Load any tokenizer from the Hugging Face `tokenizers` library.	`chonkie[tokenizers]`
`tiktoken`	Use OpenAI's `tiktoken` library (e.g., for `gpt-4`).	`chonkie[tiktoken]`
`transformers`	Load tokenizers via `AutoTokenizer` from HF `transformers`.	`chonkie[neural]`

default indicates that the feature is available with the default pip install chonkie.

To use a custom token counter, you can pass in any function that takes a string and returns an integer! Something like this:

def custom_token_counter(text: str) -> int:
    return len(text)

chunker = RecursiveChunker(tokenizer=custom_token_counter)

You can use this to extend Chonkie to support any tokenization scheme you want!

🧠 Embed like a boss! Chonkie links up with 9+ embedding pals!

Seamlessly works with various embedding model providers. Bring your favorite embeddings to the CHONK party! Use AutoEmbeddings to load models easily.

Provider / Alias	Class	Description	Optional Install
`model2vec`	`Model2VecEmbeddings`	Use `Model2Vec` models.	`chonkie[model2vec]`
`sentence-transformers`	`SentenceTransformerEmbeddings`	Use any `sentence-transformers` model.	`chonkie[st]`
`openai`	`OpenAIEmbeddings`	Use OpenAI's embedding API.	`chonkie[openai]`
`azure-openai`	`AzureOpenAIEmbeddings`	Use Azure OpenAI embedding service.	`chonkie[azure-openai]`
`cohere`	`CohereEmbeddings`	Use Cohere's embedding API.	`chonkie[cohere]`
`gemini`	`GeminiEmbeddings`	Use Google's Gemini embedding API.	`chonkie[gemini]`
`jina`	`JinaEmbeddings`	Use Jina AI's embedding API.	`chonkie[jina]`
`voyageai`	`VoyageAIEmbeddings`	Use Voyage AI's embedding API.	`chonkie[voyageai]`
`litellm`	`LiteLLMEmbedd

Core symbols most depended-on inside this repo

get

called by 111

src/chonkie/cloud/pipeline.py

chunk_with

called by 73

src/chonkie/cloud/pipeline.py

run

called by 54

src/chonkie/cloud/pipeline.py

refine

called by 53

src/chonkie/refinery/overlap.py

get_logger

called by 42

src/chonkie/logger.py

_generate_id

called by 37

src/chonkie/handshakes/base.py

register_model

called by 36

src/chonkie/embeddings/registry.py

get_embeddings

called by 31

src/chonkie/embeddings/auto.py

Shape

Method 1,374

Function 1,143

Class 311

Route 47

Languages

Python100%

Modules by API surface

tests/pipeline/test_pipeline_coverage.py121 symbols

tests/test_tokenizer.py91 symbols

src/chonkie/tokenizer.py75 symbols

tests/chunkers/test_slumber_chunker.py68 symbols

tests/test_viz.py67 symbols

tests/test_pipeline.py66 symbols

tests/refinery/test_overlap_refinery.py51 symbols

tests/genie/test_base_genie.py51 symbols

tests/embeddings/test_embeddings_registry.py50 symbols

tests/test_cli.py46 symbols

tests/chunkers/test_table_chunker.py43 symbols

tests/chunkers/test_neural_chunker.py43 symbols

Dependencies from manifests, versioned

chonkie-core0.10.2 · 1×

httpx0.28.1 · 1×

numpy2.0.0 · 1×

tenacity8.0.0 · 1×

tokie0.0.10 · 1×

tqdm4.64.0 · 1×

Datastores touched

(mongodb)Database · 1 repos

dbDatabase · 1 repos

test_dbDatabase · 1 repos

For agents

$ claude mcp add chonkie \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact