EvalPlus(📖) => 📚<a href="https://evalplus.github.io/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a>
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a>
<a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a>
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg"></a>
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a>
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a>
<a href="#-news">📰News</a> •
<a href="#-quick-start">🔥Quick Start</a> •
<a href="#-llm-backends">🚀LLM Backends</a> •
<a href="#-documents">📚Documents</a> •
<a href="#-citation">📜Citation</a> •
<a href="#-acknowledgement">🙏Acknowledgement</a>
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
Why EvalPlus?
Want to know more details? Read our papers & materials!
Below tracks the notable updates of EvalPlus:
v0.3.1]: EvalPlus v0.3.1 is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc.v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32).v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6)v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.v0.1.0) HumanEval+ is released!pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy
# Code execution within Docker
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--backend vllm
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperture 1.0 \
--n-samples 100
# Code execution within Docker
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
transformers backend:evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
[!Note]
EvalPlus uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_templatewhen usinghf/vllmas backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template, please add--force-base-promptto avoid being evaluated in a chat mode.
Enable Flash Attention 2 :: click to expand ::
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
vllm backend:evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openai compatible servers (e.g., vLLM):# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
To learn more about how to use EvalPlus, please refer to:
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
$ claude mcp add evalplus \
-- python -m otcore.mcp_server <graph>