hub / github.com/GeeeekExplorer/nano-vllm

github.com/GeeeekExplorer/nano-vllm @main sqlite

repository ↗ · DeepWiki ↗

137 symbols 376 edges 21 files 0 documented · 0% 1 cross-repo links

README

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration: - Hardware: RTX 4070 Laptop (8GB) - Model: Qwen3-0.6B - Total Requests: 256 sequences - Input Length: Randomly sampled between 100–1024 tokens - Output Length: Randomly sampled between 100–1024 tokens

Performance Results: | Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | |----------------|-------------|----------|-----------------------| | vLLM | 133,966 | 98.37 | 1361.84 | | Nano-vLLM | 133,966 | 93.41 | 1434.13 |

Star History

Core symbols most depended-on inside this repo

update

called by 4

nanovllm/engine/block_manager.py

divide

called by 4

nanovllm/layers/linear.py

get_context

called by 3

nanovllm/utils/context.py

set_context

called by 3

nanovllm/utils/context.py

add

called by 3

nanovllm/engine/scheduler.py

compute_hash

called by 3

nanovllm/engine/block_manager.py

call

called by 3

nanovllm/engine/model_runner.py

block

called by 3

nanovllm/engine/sequence.py

Shape

Method 96

Class 29

Function 12

Languages

Python100%

Modules by API surface

nanovllm/layers/linear.py22 symbols

nanovllm/models/qwen3.py16 symbols

nanovllm/engine/model_runner.py16 symbols

nanovllm/engine/sequence.py15 symbols

nanovllm/engine/block_manager.py15 symbols

nanovllm/layers/embed_head.py7 symbols

nanovllm/engine/scheduler.py7 symbols

nanovllm/engine/llm_engine.py7 symbols

nanovllm/layers/rotary_embedding.py5 symbols

nanovllm/layers/layernorm.py5 symbols

nanovllm/layers/attention.py5 symbols

nanovllm/utils/context.py4 symbols

Dependencies from manifests, versioned

flash-attn1×

torch2.4.0 · 1×

transformers4.51.0 · 1×

triton3.0.0 · 1×

xxhash1×

For agents

$ claude mcp add nano-vllm \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact