
🚀 2-10x accuracy improvements on reasoning tasks with zero training
🤗 HuggingFace Space • 📓 Colab Demo • 💬 Discussions
OptiLLM is an OpenAI API-compatible optimizing inference proxy that implements 20+ state-of-the-art techniques to dramatically improve LLM accuracy and performance on reasoning tasks - without requiring any model training or fine-tuning.
It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time. A good example of how to combine such techniques together is the CePO approach from Cerebras.
Get powerful reasoning improvements in 3 simple steps:
# 1. Install OptiLLM
pip install optillm
# 2. Start the server
export OPENAI_API_KEY="your-key-here"
optillm
# 3. Use with any OpenAI client - just change the model name!
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")
# Add 'moa-' prefix for Mixture of Agents optimization
response = client.chat.completions.create(
model="moa-gpt-4o-mini", # This gives you GPT-4o performance from GPT-4o-mini!
messages=[{"role": "user", "content": "Solve: If 2x + 3 = 7, what is x?"}]
)
Before OptiLLM: "x = 1" ❌
After OptiLLM: "Let me work through this step by step: 2x + 3 = 7, so 2x = 4, therefore x = 2" ✅
OptiLLM delivers measurable improvements across diverse benchmarks:
| Technique | Base Model | Improvement | Benchmark |
|---|---|---|---|
| MARS | Gemini 2.5 Flash Lite | +30.0 points | AIME 2025 (43.3→73.3) |
| CePO | Llama 3.3 70B | +18.6 points | Math-L5 (51.0→69.6) |
| AutoThink | DeepSeek-R1-1.5B | +9.34 points | GPQA-Diamond (21.72→31.06) |
| LongCePO | Llama 3.3 70B | +13.6 points | InfiniteBench (58.0→71.6) |
| MOA | GPT-4o-mini | Matches GPT-4 | Arena-Hard-Auto |
| PlanSearch | GPT-4o-mini | +20% pass@5 | LiveCodeBench |
Full benchmark results below ⬇️
pip install optillm
optillm
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto
docker pull ghcr.io/algorithmicsuperintelligence/optillm:latest
docker run -p 8000:8000 ghcr.io/algorithmicsuperintelligence/optillm:latest
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto
Available Docker image variants:
latest): Includes all dependencies for local inference and pluginslatest-proxy): Lightweight image without local inference capabilitieslatest-offline): Self-contained image with pre-downloaded models (spaCy) for fully offline operation# Proxy-only (smallest)
docker pull ghcr.io/algorithmicsuperintelligence/optillm:latest-proxy
# Offline (largest, includes pre-downloaded models)
docker pull ghcr.io/algorithmicsuperintelligence/optillm:latest-offline
Clone the repository with git and use pip install to setup the dependencies.
git clone https://github.com/algorithmicsuperintelligence/optillm.git
cd optillm
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
OptILLM supports SSL certificate verification configuration for working with self-signed certificates or corporate proxies.
Disable SSL verification (development only):
# Command line
optillm --no-ssl-verify
# Environment variable
export OPTILLM_SSL_VERIFY=false
optillm
Use custom CA certificate:
# Command line
optillm --ssl-cert-path /path/to/ca-bundle.crt
# Environment variable
export OPTILLM_SSL_CERT_PATH=/path/to/ca-bundle.crt
optillm
⚠️ Security Note: Disabling SSL verification is insecure and should only be used in development. For production environments with custom CAs, use --ssl-cert-path instead. See SSL_CONFIGURATION.md for details.
| Approach | Slug | Description |
|---|---|---|
| MARS (Multi-Agent Reasoning System) | mars |
Multi-agent reasoning with diverse temperature exploration, cross-verification, and iterative improvement |
| Cerebras Planning and Optimization | cepo |
Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques |
| CoT with Reflection | cot_reflection |
Implements chain-of-thought reasoning with \<thinking>, \ and \ sections |
| PlanSearch | plansearch |
Implements a search algorithm over candidate plans for solving a problem in natural language |
| ReRead | re2 |
Implements rereading to improve reasoning by processing queries twice |
| Self-Consistency | self_consistency |
Implements an advanced self-consistency method |
| Z3 Solver | z3 |
Utilizes the Z3 theorem prover for logical reasoning |
| R* Algorithm | rstar |
Implements the R* algorithm for problem-solving |
| LEAP | leap |
Learns task-specific principles from few shot examples |
| Round Trip Optimization | rto |
Optimizes responses through a round-trip process |
| Best of N Sampling | bon |
Generates multiple responses and selects the best one |
| Mixture of Agents | moa |
Combines responses from multiple critiques |
| Monte Carlo Tree Search | mcts |
Uses MCTS for decision-making in chat responses |
| PV Game | pvg |
Applies a prover-verifier game approach at inference time |
| Deep Confidence | N/A for proxy | Implements confidence-guided reasoning with multiple intensity levels for enhanced accuracy |
| CoT Decoding | N/A for proxy | Implements chain-of-thought decoding to elicit reasoning without explicit prompting |
| Entropy Decoding | N/A for proxy | Implements adaptive sampling based on the uncertainty of tokens during generation |
| Thinkdeeper | N/A for proxy | Implements the reasoning_effort param from OpenAI for reasoning models like DeepSeek R1 |
| AutoThink | N/A for proxy | Combines query complexity classification with steering vectors to enhance reasoning |
| Plugin | Slug | Description |
|---|---|---|
| System Prompt Learning | spl |
Implements what Andrej Karpathy called the third paradigm for LLM learning, this enables the model to acquire program solving knowledge and strategies |
| Deep Think | deepthink |
Implements a Gemini-like Deep Think approach using inference time scaling for reasoning LLMs |
| Long-Context Cerebras Planning and Optimization | longcepo |
Combines planning and divide-and-conquer processing of long documents to enable infinite context |
| Majority Voting | majority_voting |
Generates k candidate solutions and selects the most frequent answer through majority voting (default k=6) |
| MCP Client | mcp |
Implements the model context protocol (MCP) client, enabling you to use any LLM with any MCP Server |
| Router | router |
Uses the optillm-modernbert-large model to route requests to different approaches based on the user prompt |
| Chain-of-Code | coc |
Implements a chain of code approach that combines CoT with code execution and LLM based code simulation |
| Memory | memory |
Implements a short term memory layer, enables you to use unbounded context length with any LLM |
| Privacy | privacy |
Anonymize PII data in request and deanonymize it back to original value in response |
| Read URLs | readurls |
Reads all URLs found in the request, fetches the content at the URL and adds it to the context |
| Execute Code | executecode |
Enables use of code interpreter to execute python code in requests and LLM generated responses |
| JSON | json |
Enables structured outputs using the outlines library, supports pydantic types and JSON schema |
| GenSelect | genselect |
Generative Solution Selection - generates multiple candidates and selects the best based on quality criteria |
| Web Search | web_search |
Performs Google searches using Chrome automation (Selenium) to gather search results and URLs |
| Deep Research | deep_research |
Implements Test-Time Diffusion Deep Researcher (TTD-DR) for comprehensive research reports using iterative refinement |
| Proxy | proxy |
Load balancing and failover across multiple LLM providers with health monitoring and round-robin routing |
We support all major LLM providers and models for inference. You need to set the correct environment variable and the proxy will pick the corresponding client.
| Provider | Required Environment Variables | Additional Notes |
|---|---|---|
| OptiLLM | OPTILLM_API_KEY |
Uses the inbuilt local server for inference, supports logprobs and decoding techniques like cot_decoding & entropy_decoding |
| OpenAI | OPENAI_API_KEY |
You can use this with any OpenAI compatible endpoint (e.g. OpenRouter) by setting the base_url |
| Cerebras | CEREBRAS_API_KEY |
You can use this for fast inference with supported models, see docs for details |
| Azure OpenAI | AZURE_OPENAI_API_KEY |
AZURE_API_VERSION
AZURE_API_BASE | - |
| Azure OpenAI (Managed Identity) | AZURE_API_VERSION
AZURE_API_BASE | Login required using az login, see docs for details |
| LiteLLM | depends on the model | See docs for details |
You can then run the optillm proxy as follows.
```bash python optillm.py 2024-09-06 07:57:14,191 - INFO - Starting server with approach: auto 2024-09-06 07:57:14,191 - INFO - Server configuration: {'approach': 'auto', 'mcts_simulations': 2, 'mcts_exploration': 0.2, 'mcts_depth': 1, 'best_of_n': 3, 'model': 'gpt-4o-mini', 'rstar_max_depth': 3, 'rstar_num_rollouts': 5, 'rstar_c': 1.4, 'base_url': '', 'host': '127.0.0.1'} * Serving Flask app 'optillm'
$ claude mcp add optillm \
-- python -m otcore.mcp_server <graph>