██╗ ██╗███████╗ █████╗ ██████╗ ██████╗ ██████╗ ██████╗ ███╗ ███╗
██║ ██║██╔════╝██╔══██╗██╔══██╗██╔══██╗██╔═══██╗██╔═══██╗████╗ ████║
███████║█████╗ ███████║██║ ██║██████╔╝██║ ██║██║ ██║██╔████╔██║
██╔══██║██╔══╝ ██╔══██║██║ ██║██╔══██╗██║ ██║██║ ██║██║╚██╔╝██║
██║ ██║███████╗██║ ██║██████╔╝██║ ██║╚██████╔╝╚██████╔╝██║ ╚═╝ ██║
╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝
The context compression layer for AI agents
60–95% fewer tokens · library · proxy · MCP · content-aware compressors · local-first · reversible
Docs · Install · Proof · Agents · Discord · llms.txt
AI agents / LLMs: read /llms.txt here, or fetch the live index / full docs blob.
Headroom compresses everything your AI agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the LLM. Same answers, fraction of the tokens.

Live: 10,144 → 1,260 tokens — same FATAL found.
compress(messages) in Python or TypeScript, inline in any appheadroom proxy --port 8787, zero code changes, any languageheadroom wrap claude|codex|copilot|cursor|aider|opencode|cline|continue|goose|openhands|openclaw|vibe in one command; undo with headroom unwrap <tool>headroom_compress, headroom_retrieve, headroom_stats for any MCP clientheadroom learn — mines failed sessions, writes corrections to CLAUDE.local.md (default, gitignored) or CLAUDE.md / AGENTS.md / GEMINI.md Your agent / app
(Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code…)
│ prompts · tool outputs · logs · RAG results · files
▼
┌────────────────────────────────────────────────────┐
│ Headroom (runs locally — your data stays here) │
│ ──────────────────────────────────────────────── │
│ CacheAligner → ContentRouter → CCR │
│ ├─ SmartCrusher (JSON) │
│ ├─ CodeCompressor (AST) │
│ └─ Kompress-v2-base (text, HF) │
│ │
│ Cross-agent memory · headroom learn · MCP │
└────────────────────────────────────────────────────┘
│ compressed prompt + retrieval tool
▼
LLM provider (Anthropic · OpenAI · Bedrock · …)
headroom_retrieve if it needs them→ Architecture · CCR reversible compression · Kompress-v2-base model card
# 1 — Install
pip install "headroom-ai[all]" # Python
npm install headroom-ai # Node / TypeScript
# 2 — Pick your mode
headroom wrap claude # wrap a coding agent
headroom proxy --port 8787 # drop-in proxy, zero code changes
# or: from headroom import compress # inline library
# 3 — Verify setup and see the savings
headroom doctor # health check — confirms routing is working
headroom perf
headroom dashboard # live savings dashboard (proxy must be running)
Granular extras: [proxy], [mcp], [ml], [code], [memory], [vector] (optional HNSW backend — needs a C++ toolchain, not in [all]), [relevance], [image], [agno], [langchain], [evals], [pytorch-mps] (Apple-GPU memory-embedder offload — set HEADROOM_EMBEDDER_RUNTIME=pytorch_mps). Requires Python 3.10+.
Savings on real agent workloads:
| Workload | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
Accuracy preserved on standard benchmarks:
| Benchmark | Category | N | Baseline | Headroom | Delta |
|---|---|---|---|---|---|
| GSM8K | Math | 100 | 0.870 | 0.870 | ±0.000 |
| TruthfulQA | Factual | 100 | 0.530 | 0.560 | +0.030 |
| SQuAD v2 | QA | 100 | — | 97% | 19% compression |
| BFCL | Tools | 100 | — | 97% | 32% compression |
Reproduce: python -m headroom.evals suite --tier 1 · Full benchmarks & methodology
Everything above shrinks the prompt you send. But you also pay for every token the model writes back — and on Opus-class models output costs 5× input. A lot of that output is waste: "Great, let me…" preambles, re-printing code you just showed it, and deep "thinking" on routine steps like reading a file.
Headroom can trim that too, from the proxy, without you changing any code:
Turn it on:
export HEADROOM_OUTPUT_SHAPER=1 # off by default
headroom proxy --port 8787
Already running a proxy? These switches are read live on every request, so a proxy that
headroom wrapreused (rather than started) would not see a value you export afterwards — its environment was snapshotted at launch.headroom wrapnow hot-syncs your current settings to the running proxy via a loopbackPOST /admin/runtime-env, so they take effect immediately with no restart (no cold start, no dropped requests, no lost caches). Set them before youwrap. On a shared proxy these overrides are global — the last explicit setting wins.
Learn the right terseness for you. People don't say how terse they want
answers — they show it (they interrupt long replies, or move on before they
could have read them). headroom learn --verbosity reads your past sessions and
picks the level automatically:
headroom learn --verbosity # preview what it found (dry run)
headroom learn --verbosity --apply # save it; the proxy uses it from now on
See how many output tokens you saved. Output savings are counterfactual — we never see what the model would have written — so Headroom reports an honest estimate with a confidence range, never a made-up number:
headroom output-savings
# Reduction: 31.7% (95% CI 27.7% … 35.7%) [estimated]
Want a measured number instead of an estimate? Leave 10% of conversations
unshaped as a control group: export HEADROOM_OUTPUT_HOLDOUT=0.1. The dashboard
shows an Output Tokens Saved card next to input compression, labelled
measured or estimated with the confidence band.
→ Full write-up incl. the measurement methodology: Output token reduction
| Agent | headroom wrap |
Notes |
|---|---|---|
| Claude Code | ✅ | --memory · --code-graph · --1m · --tool-search |
| Codex | ✅ | shares memory with Claude |
| Cursor | Manual setup | starts proxy and prints base URLs for Cursor settings |
| Aider | ✅ | starts proxy + launches |
| Copilot CLI | ✅ | starts proxy + launches |
| OpenClaw | ✅ | installs as ContextEngine plugin |
| OpenCode | ✅ | injects config · starts proxy + launches |
| Cline | ✅ | starts proxy + injects config |
| Continue | ✅ | starts proxy + injects config |
| Goose | ✅ | starts proxy + launches |
| OpenHands | ✅ | starts proxy + launches |
| Mistral Vibe | ✅ | starts proxy + launches |
| Cortex Code | Library only | 60–65% savings (library mode; no wrap) |
Any OpenAI-compatible client works via headroom proxy. MCP-native: headroom mcp install.
Undo durable wrapping with headroom unwrap <tool> (supports: claude, copilot, codex, opencode, openclaw).
Headroom can route GitHub Copilot CLI subscription traffic through the local proxy:
headroom copilot-auth login
headroom wrap copilot --subscription -- --model gpt-4o
This lets Headroom intercept OpenAI-compatible Copilot CLI requests and apply the same proxy compression pipeline before forwarding to GitHub Copilot's hosted API. The wrapper exchanges Headroom's reusable GitHub OAuth token for Copilot's short-lived API token and prints the upstream endpoint as COPILOT_PROVIDER_API_URL=... during launch.
headroom copilot-auth login stores a Headroom-specific Copilot OAuth token.
This avoids relying on generic GitHub or Copilot CLI tokens that can read
Copilot account metadata but may still be rejected by Copilot's token-exchange
endpoint.
For GitHub Enterprise Server or custom-domain Copilot deployments, set the deployment domain before launching:
export GITHUB_COPILOT_ENTERPRISE_DOMAIN=ghe.example.com
For GitHub.com Enterprise Cloud URLs such as
github.com/enterprises/your-enterprise, do not set an enterprise-domain
override. Headroom uses GitHub's normal token-exchange endpoint and the Copilot
API endpoint advertised for the signed-in account.
Platform support note: macOS auth reuse via Copilot CLI Keychain storage has been smoke-tested. Windows Credential Manager, Linux Secret Service / secret-tool, and Docker/CI token-injection paths are implemented or planned as auth-discovery paths, but still need real OS validation before they should be considered fully vetted. For Docker and CI, prefer passing an explicit GITHUB_COPILOT_TOKEN or GITHUB_COPILOT_GITHUB_TOKEN rather than relying on host keychain access.
Great fit if you… - run AI coding agents daily and want savings without changing your code - work across multiple agents and want shared memory - need reversible compression — originals are retrievable via CCR within the configured TTL
Skip it if you… - only use a single provider's native compaction and don't need cross-agent memory - work in a sandboxed environment where local processes can't run
Integrations — drop Headroom into any stack
| Your setup | Hook in with |
|---|---|
| Any Python |
$ claude mcp add headroom \
-- python -m otcore.mcp_server <graph>