hub / github.com/Ontos-AI/knowhere

github.com/Ontos-AI/knowhere @v2026.06.18.2 sqlite

repository ↗ · DeepWiki ↗ · release v2026.06.18.2 ↗

3,686 symbols 17,575 edges 576 files 1,330 documented · 36%

README

20260506-102713

Prepare unstructured data for AI Agents

🔗 Website | 📄 Docs | 🏠 Self-Host | 🖥️ Dashboard

Overview

Knowhere is the memory layer between complex, dirty documents and AI agents.

It ingests unstructured documents and produces persistent, navigable memory: parsing, hierarchy extraction, multi-modal structuring, and graph construction in a single pipeline. Every chunk retains full semantic context, making the output a natural fit for Agentic RAG, vector-based RAG, or any LLM workflow.

[!NOTE] Get started in seconds with Knowhere Cloud. Avoid the complexity of self-deployment. Use our managed API at knowhereto.ai and enjoy $5 in free credits upon registration.

📢 News

June 1, 2026: 📚 Knowhere now supports ultra-long PDFs and atlas-style documents. The parsing pipeline can process long-form PDFs with hundreds of pages (for example, 300, 500, or more) and route technical atlases or drawing collections through a dedicated layout-aware parser.
May 7, 2026: 🚀 Knowhere is now Open Source! We have open-sourced our entire stack for document ingestion, parsing, and agentic RAG. You can now self-host the full platform using knowhere-self-hosted. Check out our Contribution Guide to get involved!

How it Works

Knowhere runs in two steps: build memory from documents, then let agents retrieve from it.

Step 1: Parse and Build Memory

Parse: Route PDFs, Office files, images, tables, Markdown, and text to specialized parsers.
Structure: Our proprietary Tree-like algorithm reconstructs the full document hierarchy instead of flattening it into a sequence, preventing semantic fragmentation across chunks.
Build Memory: Store chunks, navigation trees, summaries, and graph links as agent-ready context.

Step 2: Agentic Retrieval

Discover: Fuse keyword, path, content, and semantic signals for broad first-pass coverage.
Navigate: Walk section trees and graph links to drill into the most relevant document regions.
Cite Evidence: Return traceable results with source document, section, chunk, and linked assets.

FAQ

Q: What is Knowhere's relationship with MinerU?

A: Knowhere uses MinerU as its default parser because it performs best in our tests. Any parser only gets you raw Markdown. Knowhere's value is what comes after: hierarchy reconstruction, multi-modal normalization, and cross-document graph construction. Any Markdown-outputting tool works.

Q: What LLM / VLM dependencies does Knowhere have?

A: By default, DeepSeek (deepseek-chat) handles text and table summarization, and Qwen-VL (qwen3.6-flash) handles image OCR and descriptions. Knowhere is model-agnostic. Swap in OpenAI, DashScope, Zhipu, or Volcengine via environment variables.

Q: How is Agentic Retrieval different from traditional RAG?

A: Traditional RAG does a flat vector lookup and returns isolated snippets. Knowhere's agents navigate the document's section tree and cross-document graph, drilling into the most relevant regions the way a human reader would, returning traceable, well-contextualized evidence.

Q: Does it handle images and tables?

A: Yes. Knowhere extracts them, runs them through VLMs for summarization and feature extraction, and links them back to their source chunks so agents can retrieve and cite multi-modal assets at inference time.

Performance Benchmark

Agents using Knowhere outperform those working from raw documents, Markitdown, Unstructured, or MinerU output on real-world tasks: searching, modifying, and answering questions.

Benchmark Performance: Agent + Knowhere vs Others

We're not developing the next MinerU — we're building document memory infrastructure that agents can effectively consume.

Key Advantages

Accuracy: +36% first-try accuracy and +11% recall over raw documents.
Reliability: 79% accuracy with feedback, vs. a ~53% ceiling on raw docs.
Efficiency: Fewer loops, fewer tokens, less time. Agents navigate a structured graph instead of reading monolithic text.

(Internal evaluation across identical agentic RAG tasks. Baselines: raw documents and parser output fed directly to agents.)

[!NOTE] 📊 Benchmarks are actively expanding. More parsers and retrieval baselines coming soon.

Ecosystem

Repository	Description
knowhere	This repo. Backend API and worker: document ingestion, parsing, graph construction, and retrieval.
🖥️ knowhere-dashboard	The web UI. Connects to the API for the full product experience.
🐳 knowhere-self-hosted	Docker Compose stack for self-hosted deployments. Packages the API, worker, and dashboard together.
🐍 knowhere-python-sdk	Official Python SDK for the Knowhere Cloud API.
🦕 knowhere-node-sdk	Official Node.js SDK for the Knowhere Cloud API.

Features

Multi-modal Parsing: High-fidelity extraction from PDF, Office, and images, preserving headings, tables, and hierarchical paths.
Lightweight Memory Graph: Context-aware organization that links documents and chunks for better relationship understanding.
Agentic RAG: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation.
Evidence-based Citations: Every result is backed by traceable source paths, ensuring reliability for AI Agent decision-making.

Supported Formats

✅ Supported

[x] .pdf .docx .pptx .xlsx .csv
[x] .jpg .png
[x] .md .txt .json

⏳ Coming Soon

[ ] .epub .html .xml
[ ] .mp4 .mp3
[ ] .skills.md

Want to see a new format supported? Adding a parser is a great first contribution. Check out CONTRIBUTING.md to get started.

Prerequisites

Python 3.11+
uv
Docker with docker compose

Quick Start

Sync the workspace dependencies:

uv sync --all-packages

Copy the environment examples:

cp apps/api/.env.example apps/api/.env
cp apps/worker/.env.example apps/worker/.env

Update the copied .env files with the values you need for local work:
database and Redis connection settings
S3-compatible storage credentials
at least one LLM provider key: DS_KEY, ALI_API_KEYS, GPT_API_KEY, or GLM_API_KEY
MINERU_API_KEYS if you need PDF parsing
a vision-capable model provider if you need image summaries, OCR, atlas classification, or image-aware retrieval
any optional billing or webhook providers you want to enable

Most parser and retrieval tuning values have code defaults. Start with the required external services first, then override model names, provider URLs, budgets, or concurrency limits only when your deployment needs different behavior. See docs/external-services.md for the full dependency matrix.

Start the local infrastructure stack:

./deploy/local-dev/start-dev.sh

Start the API and worker in separate terminals:

cd apps/api && uv run main.py
cd apps/worker && uv run worker.py

The API runs migrations during startup.

For API-only development without the dashboard, create an API-only user/key after the API service starts:

cd apps/api
uv run scripts/init_user.py --email you@example.com

If you plan to use the dashboard, register through the dashboard instead of using scripts/init_user.py.

The API is now running at http://localhost:5005. If you want the full product experience with a UI, run the knowhere-dashboard alongside it; it connects to this API out of the box.

Quality Checks

Run lint checks from the repository root:

make lint

Apply safe Ruff fixes:

make lint-fix

Run type checks across the API, worker, and shared source code:

make typecheck

Run both lint and type checks:

make check

Local Endpoints

API: http://localhost:5005
OpenAPI docs: http://localhost:5005/docs
LocalStack: http://localhost:4566
PostgreSQL: localhost:5432
Redis: localhost:6379

Additional Guides

External dependency guide: docs/external-services.md

Citation

If you use Knowhere in your research, please cite it as:

@software{knowhere2026,
  author       = {Ontos AI},
  title        = {Knowhere: Prepare Unstructured Data for AI Agents},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Ontos-AI/knowhere},
  version      = {2026.04.30.1},
  license      = {Apache-2.0}
}

Communication

GitHub Discussions for questions, ideas, and general conversation.
GitHub Issues for bug reports and feature requests.

Contribution

Any contributions to Knowhere are more than welcome!

If you are new to the project, check out the good first issues. They are well-defined, relatively simple, and a great way to get familiar with the codebase and the contribution workflow.

For general guidelines on branching, commit conventions, and the review process, take a look at CONTRIBUTING.md.

Other useful references:

SECURITY.md: how to report vulnerabilities responsibly.
CODE_OF_CONDUCT.md: community behavior expectations.
LICENSE and NOTICE: Apache 2.0.

👋 We're Hiring!

We're building the knowledge layer for the Agent era. If that sounds like work you want to do, reach out. Decode the address below and drop us a line:

echo 'dGVhbUBrbm93aGVyZXRvLmFp' | base64 --decode

Core symbols most depended-on inside this repo

append

called by 692

apps/worker/app/services/document_parser/support/parser_rows.py

get

called by 654

packages/shared-python/shared/services/retrieval/agentic/navigation/actions.py

get

called by 488

apps/worker/app/services/document_agent/registry.py

get

called by 284

packages/shared-python/shared/services/redis/redis_service.py

error

called by 264

packages/shared-python/shared/services/retrieval/agentic/core/types.py

get

called by 153

apps/api/app/repositories/base_repository.py

execute

called by 123

apps/api/tests/support/contract_database.py

execute

called by 122

packages/shared-python/shared/services/retrieval/execution/plan.py

Shape

Function 1,709

Method 1,396

Class 537

Route 44

Languages

Python100%

Modules by API surface

packages/shared-python/shared/core/exceptions/domain_exceptions.py70 symbols

apps/worker/tests/contract/test_doc_profile_anatomy_contract.py49 symbols

packages/shared-python/shared/testing/contract_runtime.py47 symbols

packages/shared-python/shared/services/redis/redis_sync_service.py42 symbols

apps/worker/app/services/document_agent/structure/hierarchy_locator.py42 symbols

packages/shared-python/shared/services/redis/key_builder.py38 symbols

apps/api/tests/contract/test_job_creation_contract.py37 symbols

apps/api/tests/contract/test_self_hosted_telemetry_contract.py36 symbols

apps/api/app/services/demo/source_projection.py35 symbols

apps/worker/app/services/document_agent/manifest.py32 symbols

packages/shared-python/shared/services/redis/redis_service.py31 symbols

packages/shared-python/shared/services/retrieval/agentic/navigation/actions.py29 symbols

Dependencies from manifests, versioned

PyJWT2.12.0 · 1×

aiohappyeyeballs2.6.1 · 1×

aiohttp3.13.4 · 1×

aiosignal1.4.0 · 1×

alembic1.13.1 · 1×

aliyun-python-sdk-core2.16.0 · 1×

aliyun-python-sdk-kms2.16.5 · 1×

amqp5.3.1 · 1×

annotated-doc0.0.4 · 1×

annotated-types0.7.0 · 1×

anyio4.13.0 · 1×

argon2-cffi23.1.0 · 1×

For agents

$ claude mcp add knowhere \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact