MCPcopy
hub / github.com/TheTom/turboquant_plus

github.com/TheTom/turboquant_plus @v0.3.2.3 sqlite

repository ↗ · DeepWiki ↗ · release v0.3.2.3 ↗
1,637 symbols 5,939 edges 81 files 619 documented · 38%
README

TurboQuant+

Getting Started Guide | Configuration Recommendations | llama.cpp Fork | Swift MLX Fork | vllm-swift | Commercial Support

🍎 On Apple Silicon and want the fastest path? Use ekryski/mlx-swift-lm — Eric Kryski's Swift MLX implementation that I've been actively collaborating on. Native Swift, ~2.5x faster decode than Python mlx-lm, full TurboQuant+ support including turbo4v2 (4-bit K + 2-bit V). 144 tok/s on Qwen3.5-35B-A3B MoE at 4K on M5 Max. For OpenAI-compatible serving, use vllm-swift — a native Swift/Metal backend for vLLM built on mlx-swift-lm. No Python in the inference hot path, works with Hermes, OpenCode, and any OpenAI client. This llama.cpp repo is for cross-platform deployment (CUDA, ROCm, CPU, Metal).

Prebuilt Binaries

Download ready-to-run builds from Releases. No build tools needed.

Platform Download
Mac (Apple Silicon, Metal) turboquant-plus-*-macos-arm64-metal.tar.gz
Windows (CUDA 12.4) turboquant-plus-*-windows-x64-cuda12.4.zip

Unpack and run. Includes llama-server, llama-bench, llama-cli, and all tools.


Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. KV cache compression for local LLM inference.

Note

This repository is an experimental integration and research workspace for TurboQuant-related work targeting llama.cpp. The goal is to make it easier to compare approaches, collect reproducible benchmark and quality data, and share implementation details across hardware and backends. It is not intended as a separate long-term fork or a proposal to merge the branch as a whole.

If individual pieces prove useful and stable, the intent is to upstream them incrementally as small, reviewable patches in line with llama.cpp's normal contribution process.

What's In This Branch

  • Experimental TurboQuant-related integrations for llama.cpp
  • Benchmark and quality validation across models, contexts, and hardware
  • Backend-specific implementation work and performance experiments
  • Documentation and writeups intended to make testing and reproduction easier
  • Candidate ideas that may be worth upstreaming individually if they prove stable

Current Findings

Three follow-on findings in this branch have been independently validated by multiple researchers across different hardware and backends:

  1. V compression is free. Compressing the value cache (even down to 2 bits) has zero measurable effect on attention quality when key precision is maintained. Confirmed on Metal (M5 Max), CUDA RTX 4090 (@sztlink), and CUDA RTX 3090 (@HyperionMS2040). See asymmetric K/V paper.
  2. All quality degradation comes from K compression. This is why asymmetric configs (q8_0-K + turbo-V) rescue models where symmetric fails. Validated across Qwen, Llama, Mistral, and Command-R+ families. See M5 Max stress test.
  3. Boundary layers are disproportionately sensitive. Protecting the first 2 + last 2 layers at higher precision recovers 37-91% of the quality gap. See Boundary V paper.

Additional experiments and writeups: Sparse V dequant (+22.8% decode), block size optimization (5.12x compression), turbo4 resurrection (QJL hurts, PolarQuant works), EDEN optimal-S response (rotation is first-order, scale is second-order).

Compresses transformer KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Near q8_0 prefill speed and ~0.9x decode throughput at long context (Apple Silicon). Full format family: turbo2 (2-bit, 6.4x), turbo3 (3-bit, 4.6-5.1x), turbo4 (4-bit, 3.8x). turbo3 compression depends on storage block size; see block size study.

Sparse V: Attention-gated KV cache decoding that skips low-weight V positions during inference. Up to +22.8% decode speed at 32K context, validated on wikitext-103 (50 chunks, CI +/-0.021) with no measurable PPL change. Not TurboQuant-specific; validated across q8_0, q4_0, and turbo3 KV formats. ~1% perplexity increase vs q8_0 from compression; Sparse V itself introduces no additional degradation (ON/OFF delta = 0.000).

Validated end-to-end from 1.5B to 104B on M5 Max via llama.cpp Metal. 104B at 128K context on a MacBook with turbo3 (PPL 4.024, 74 GB peak memory).

Status: v1 Complete, Speed Optimized, Community-Tested

  • 511+ Python tests, 100% code coverage on diagnostics
  • C port integrated into llama.cpp with Metal GPU kernels
  • --cache-type-k turbo3 --cache-type-v turbo3 works on Apple Silicon (turbo2/turbo3/turbo4 all supported)
  • turbo2 Metal support: 2-bit, 6.4x compression, +6.48% PPL — for extreme memory pressure or asymmetric K/V
  • q8_0 prefill speed parity achieved (2747 vs 2694 tok/s)
  • Norm correction: PPL beats q8_0 on CUDA (-1.17%), +1.1% on Metal (ported from @spiritbuun)
  • 4-mag LUT: auto-detected on M1/M2/M3/M4, +38-45% decode at long context
  • Layer-adaptive mode 2: q8_0 quality at 3.5x compression (last 8 layers at q8_0)
  • Temporal decay: 30-34% memory savings at long context (experiment branch)
  • NIAH retrieval: 9/9 single needle with sparse V (vs 7/9 baseline), 100% multi-key through 32K. 30/30 on Llama-70B, 10/10 on Command-R+ 104B
  • 14 decode approaches tested on M2 Pro — comprehensive hardware analysis
  • Stress tested up to 104B: Command-R+ 104B Q4_K_M at 128K context (PPL 4.024). Llama-70B Q4_K_M at 48K (PPL 4.019). turbo3 prefill faster than q8_0 at 32K on both models
  • Community: 30+ testers across M1/M2/M3/M5 Mac, RTX 3080 Ti/3090/4090/5090, AMD 6800 XT/9070 XT
  • Rotation Gaussianization validated on real Qwen3 KV tensors (kurtosis 900 → 2.9)

Quality and Speed (M5 Max 128GB)

Top-of-Tree Results

Cache Type Bits/val Compression PPL (wikitext-2, 512c) vs q8_0
f16 16.0 1.0x 6.121 -0.16%
q8_0 8.5 1.9x 6.111 baseline
turbo4 4.25 3.8x 6.125 +0.23%
q4_0 4.5 3.6x 6.142 +0.52%
turbo3 3.5† 4.6x† 6.176 +1.06%
turbo2 2.5 6.4x 6.507 +6.48%

turbo4 (4-bit PolarQuant) has the best quality after q8_0 — closer to q8_0 than q4_0, at better compression. turbo3 trades quality for maximum compression. turbo2 (2-bit) trades more quality for extreme compression — best used asymmetrically.

†turbo3 at default block_size=32. At block_size=128, turbo3 achieves 3.125 bits/val and 5.12x compression with identical PPL, validated on Metal across 3 model architectures, 3 context lengths (512–32K), and 2 Apple Silicon platforms. Tested on both asymmetric (q8_0-K + turbo3-V) and symmetric (turbo3/turbo3) paths. On the tested M2 Pro setup (Qwen2.5-1.5B, q8_0-K + turbo3-V), block_size=128 also improved decode by 3–7%; this gain was not observed on M5 Max. Earlier turbo3 figures (4.6x) reflect the block_size=32 default. CUDA not yet validated. See block size study.

Important: choosing the right config for your model. TurboQuant quality depends on your base weight quantization. Models with Q8_0+ weights work well with symmetric turbo (e.g., -ctk turbo3 -ctv turbo3). Some low-bit models with Q4_K_M weights may benefit from asymmetric K/V: use -ctk q8_0 -ctv turbo4 to keep K precision high while compressing V (tested on Qwen2.5-7B Q4_K_M). K precision is the dominant quality factor because it controls attention routing via softmax. Note: not all Q4_K_M models are sensitive — Mistral-24B, Llama-70B, and Command-R+ 104B all handle symmetric turbo fine. Bigger models absorb quantization stacking better (104B: +3.6% vs 70B: +11.4% for turbo3). Validate on your specific model. See Configuration Recommendations for the full tested matrix and practical guidance.

Validated on Metal (Apple Silicon). CUDA mixed q8_0 × turbo parity is not yet verified.

Asymmetric K/V (NEW)

TurboQuant supports independent K and V cache types. In current testing, keeping K at q8_0 while compressing V with turbo rescues quality on low-bit models where symmetric turbo degrades:

Model (weights) K V PPL vs q8_0
Qwen2.5-7B (Q4_K_M) q8_0 turbo4 6.64 +1.0%
Qwen2.5-7B (Q4_K_M) q8_0 turbo3 6.71 +2.0%
Qwen2.5-7B (Q4_K_M) turbo3 turbo3 3556 catastrophic
# Validated starting point for low-bit models
# (tested on Qwen2.5-7B Q4_K_M; not all Q4_K_M models need this)
llama-server -m model-Q4_K_M.gguf -ctk q8_0 -ctv turbo4 -fa 1

Boundary V (Layer-Aware V Compression)

Not all V layers need the same precision. Boundary V protects the first 2 + last 2 layers with q8_0-V while compressing all remaining layers with turbo2-V. 15 lines of code, no speed penalty.

Model turbo2 PPL Boundary V PPL turbo3 PPL Quality recovered
phi-4-Q8_0 (40L) 4.835 4.784 4.742 55%
Qwen2.5-7B Q4_K_M (28L) 6.911 6.835 6.707 37%
Qwen3.5-35B MoE (64L) 5.257 5.148 5.137 91%
Qwen3.5-27B Dense (36L) 6.534 6.423 6.273 42%

Validated at 512 and 8K context. NIAH retrieval passed. Benefit scales with model depth (91% on 64-layer MoE). Independently validated by @Corianas_ on NanoGPT.

Enabled by default. Activate manually on older builds with TURBO_LAYER_ADAPTIVE=7 env var.

# Boundary V — boundary layers q8_0-V, rest turbo2-V
llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

See full paper.

Prefill Context Scaling (Verified 2K-32K)

Context turbo4 tok/s turbo3 tok/s q8_0 tok/s turbo4/q8_0 turbo3/q8_0
2K 2682 2708 2665 1.01x 1.02x
4K 2370 2289 2255 1.05x 1.01x
8K 2041 2054 2002 1.02x 1.03x
16K 1621 1698 1605 1.01x 1.06x
32K 1141 1204 1098 1.04x 1.10x

Prefill: both turbo3 and turbo4 match or exceed q8_0 speed. Compressed cache uses less bandwidth.

Decode Speed — MoE (M5 Max 128GB, Qwen3.5-35B-A3B, Sparse V)

Config Short (tg128) pp32768+tg128 Short vs q8_0
q8_0 85.71 tok/s 1173.91 tok/s
turbo4 79.87 tok/s 1060.12 tok/s 0.93x
turbo3 76.84 tok/s 1141.74 tok/s 0.90x

turbo4 decode is faster than turbo3 due to simpler nibble packing and direct-extract dequant.

Real-world server benchmark (70-page PDF, ~24K context):

Config Prefill tok/s Decode tok/s Decode vs q8_0
q8_0 1449.9 68.2
turbo4 1405.9 63.7 0.93x
turbo3 1417.8 53.3 0.78x

NIAH Retrieval (turbo4)

Test q8_0 turbo4 turbo3 + sparse V
Single needle (33 positions) 30/33 (90.9%) 31/33 (93.9%) 9/9 (3-pos)

turbo4 beats q8_0 on retrieval (31/33 vs 30/33). Shared failure at 8K/100% is a model weakness, not quantization. See turbo4 resurrection for the full investigation.

Large Model Stress Tests (M5 Max 128GB)

Model Params Weights Config PPL vs q8_0 Max Context NIAH
Llama-3.1-70B 70B Q4_K_M turbo4/turbo4 3.461 +6.3% 48K 30/30
Llama-3.1-70B 70B Q4_K_M turbo3/turbo3 3.629 +11.4% 48K 30/30
Command-R+ 104B 104B Q4_K_M turbo4/turbo4 6.312 +1.9% 128K 10/10
Command-R+ 104B 104B Q4_K_M turbo3/turbo3 6.415 +3.6% 128K 10/10

turbo3 prefill is faster than q8_0 at 32K on both models (70B: 80.8 vs 75.2 t/s, 104B: 64.5 vs 62.3 t/s). Smaller KV cache = less memory bandwidth during attention.

104B at 128K requires raising macOS GPU memory cap: sudo sysctl iogpu.wired_limit_mb=117964 (90% of 128GB). Without this, Metal stalls at ~49K context on 70B+ models. See Getting Started Guide for per-RAM values.

See M5 Max stress test for the full data.

KL Divergence vs f16

Cache Mean KLD Δp RMS Same top-p %
q8_0 0.001549 1.23% 98.43%
turbo4 0.009633 2.71% 95.98%
q4_0 0.008091 2.75% 95.83%
turbo3 0.016145 4.09% 94.31%

turbo4 KLD is 40% lower than turbo3. Same top-p agreement matches q4_0.

Decode Speed — Dense (M5 Max 128GB, Qwen3.5-27B, Sparse V)

Test With sparse V Without Delta
Short (tg128) 16.73 16.61 +0.7%
8K (pp8192+tg128) 298.27 294.52 +1.3%
16K (pp16384+tg128) 316.98 311.24 +1.8%

Dense models see smaller gains (attention is <5% of decode — FFN dominates). No regressions. Safe to enable by default.

Sparse V dequant skips V dequantization fo

Core symbols most depended-on inside this repo

write
called by 261
scripts/turbo_hardware_diag.py
close
called by 147
scripts/turbo_hardware_diag.py
composite_score
called by 59
refract/score.py
warning
called by 48
scripts/turbo_hardware_diag.py
_esc
called by 42
refract/report_html.py
parse
called by 37
refract/runner.py
make_kld
called by 33
refract/tests/_fixtures.py
make_gtm
called by 30
refract/tests/_fixtures.py

Shape

Function 736
Method 685
Class 146
Route 70

Languages

Python100%

Modules by API surface

tests/test_turbo_hardware_diag.py362 symbols
tests/test_niah.py153 symbols
scripts/turbo_hardware_diag.py90 symbols
tests/test_hw_replay.py47 symbols
refract/tests/test_backends.py46 symbols
refract/tests/test_report_html.py45 symbols
refract/tests/test_axes_logic.py44 symbols
refract/tests/test_mlx_backend.py42 symbols
refract/tests/test_runner.py40 symbols
refract/tests/test_trajectory.py35 symbols
scripts/niah_test.py33 symbols
refract/tests/test_cli_runs.py30 symbols

For agents

$ claude mcp add turboquant_plus \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact