MCPcopy Index your code
hub / github.com/GeeeekExplorer/nano-vllm

github.com/GeeeekExplorer/nano-vllm @main sqlite

repository ↗ · DeepWiki ↗
137 symbols 376 edges 21 files 0 documented · 0% 1 cross-repo links
README

GeeeekExplorer%2Fnano-vllm | Trendshift

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • 🚀 Fast offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration: - Hardware: RTX 4070 Laptop (8GB) - Model: Qwen3-0.6B - Total Requests: 256 sequences - Input Length: Randomly sampled between 100–1024 tokens - Output Length: Randomly sampled between 100–1024 tokens

Performance Results: | Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | |----------------|-------------|----------|-----------------------| | vLLM | 133,966 | 98.37 | 1361.84 | | Nano-vLLM | 133,966 | 93.41 | 1434.13 |

Star History

Star History Chart

Core symbols most depended-on inside this repo

update
called by 4
nanovllm/engine/block_manager.py
divide
called by 4
nanovllm/layers/linear.py
get_context
called by 3
nanovllm/utils/context.py
set_context
called by 3
nanovllm/utils/context.py
add
called by 3
nanovllm/engine/scheduler.py
compute_hash
called by 3
nanovllm/engine/block_manager.py
call
called by 3
nanovllm/engine/model_runner.py
block
called by 3
nanovllm/engine/sequence.py

Shape

Method 96
Class 29
Function 12

Languages

Python100%

Modules by API surface

nanovllm/layers/linear.py22 symbols
nanovllm/models/qwen3.py16 symbols
nanovllm/engine/model_runner.py16 symbols
nanovllm/engine/sequence.py15 symbols
nanovllm/engine/block_manager.py15 symbols
nanovllm/layers/embed_head.py7 symbols
nanovllm/engine/scheduler.py7 symbols
nanovllm/engine/llm_engine.py7 symbols
nanovllm/layers/rotary_embedding.py5 symbols
nanovllm/layers/layernorm.py5 symbols
nanovllm/layers/attention.py5 symbols
nanovllm/utils/context.py4 symbols

Dependencies from manifests, versioned

flash-attn
torch2.4.0 · 1×
transformers4.51.0 · 1×
triton3.0.0 · 1×
xxhash

For agents

$ claude mcp add nano-vllm \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact