Run multiple generative AI models on your machine and hot-swap between them on demand. llama-swap works with any OpenAI and Anthropic API compatible server and is used by thousands of people to power their local AI workflows.
Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
v1/completionsv1/chat/completionsv1/responsesv1/embeddingsv1/models - list available modelsv1/audio/speech (#36)v1/audio/transcriptions (docs)v1/audio/voicesv1/images/generationsv1/images/editsv1/messagesv1/messages/count_tokensv1/rerank, v1/reranking, /rerank/infill - for code infilling/completion - for completion endpoint/props - requires ?model={model_id} query parameter to be provided. The autoload parameter is not supported and will be ignored./sdapi/v1/txt2img/sdapi/v1/img2img/sdapi/v1/loras - requires model in request body to fetch the correct loras/ui - web UI/upstream/:model_id - direct access to upstream server (demo)/running - list currently running models (#61)POST /api/models/unload - manually unload all running models (#58)POST /api/models/unload/:model_id - unload a specific model/logs - remote log monitoringGET /logs returns buffered plain text logs.Accept: text/html is sent, /logs redirects to /ui/.GET /logs/stream keeps the connection open for live log streaming.?no-history to stream only new lines.GET /logs/stream/proxy streams proxy logs only.GET /logs/stream/upstream streams upstream process logs only.GET /logs/stream/{model_id} streams logs for one model (including IDs with slashes, like author/model)./health - just returns "OK"/metrics - system and GPU metrics for prometheusttlcmd and cmdStop togetherhooks (#235)stripParams, setParams and setParamsByIDllama-swap includes a real time web interface with a playground for testing out all sorts of local models:
View detailed token metrics:
Inspect request and responses:
Manually load and unload models:
Real time log streaming:
llama-swap can be installed in multiple ways
Two types of container images are built nightly for llama-swap:
$ docker pull ghcr.io/mostlygeek/llama-swap:unified-cuda
# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/etc/llama-swap/config/config.yaml \
ghcr.io/mostlygeek/llama-swap:unified-cuda
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda
# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
ghcr.io/mostlygeek/llama-swap:cuda
more examples
# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa
# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root
brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
[!NOTE] Maintained by MacPorts community - llama-swap port. It is not an official part of llama-swap.
sudo port install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
[!NOTE] WinGet is maintained by community contributor Dvd-Znf (#327). It is not an official part of llama-swap.
# install
C:\> winget install llama-swap
# upgrade
C:\> winget upgrade llama-swap
Binaries are available on the release page for Linux, Mac, Windows and FreeBSD.
git clone https://github.com/mostlygeek/llama-swap.gitmake clean allbuild/ subdirectory for the llama-swap binary# minimum viable config.yaml
models:
model1:
cmd: llama-server --port ${PORT} --model /path/to/model.gguf
That's all you need to get started:
models - holds all model configurationsmodel1 - the ID used in API callscmd - the command to run to start the server.${PORT} - an automatically assigned port numberAlmost all configuration settings are optional and can be added one step at a time:
matrix to run concurrent models with a custom swap logic DSLhooks to run things on startupmacros reusable snippetsttl to automatically unload modelsaliases to use familiar model names (e.g., "gpt-4o-mini")env to pass custom environment variables to inference serverscmdStop gracefully stop Docker/Podman containersuseModelName to override model names sent to upstream servers${PORT} automatic port variables for dynamic port assignmentfilters rewrite parts of requests before sending to the upstream serverSee the configuration documentation for all options.
When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, using a matrix allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.
If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. (#236)
Recommended nginx configuration snippets:
# SSE for UI events/logs
location /api/events {
proxy_pass http://your-llama-swap-backend;
proxy_buffering off;
proxy_cache off;
}
# Streaming chat completions (stream=true)
location /v1/chat/completions {
proxy_pass http://your-llama-swap-backend;
proxy_buffering off;
proxy_cache off;
}
As a safeguard, llama-swap also sets X-Accel-Buffering: no on SSE responses. However, explicitly disabling proxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.
# sends up to the last 10KB of logs
$ curl http://host/logs
# streams combined logs
curl -Ns http://host/logs/stream
# stream llama-swap's proxy status logs
curl -Ns http://host/logs/stream/proxy
# stream logs from upstream processes that llama-swap loads
curl -Ns http://host/logs/stream/upstream
# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}
# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'
# appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history'
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals for proper shutdown.
[!NOTE] Thank you to everyone who has given this project a ⭐️!
$ claude mcp add llama-swap \
-- python -m otcore.mcp_server <graph>