hub / github.com/evalplus/evalplus

github.com/evalplus/evalplus @v0.3.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.3.1 ↗

357 symbols 1,732 edges 89 files 31 documented · 9%

README

`EvalPlus(📖) => 📚`

<a href="https://evalplus.github.io/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a>
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a>
<a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a>
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg"></a>
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a>
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a>







<a href="#-news">📰News</a> •
<a href="#-quick-start">🔥Quick Start</a> •
<a href="#-llm-backends">🚀LLM Backends</a> •
<a href="#-documents">📚Documents</a> •
<a href="#-citation">📜Citation</a> •
<a href="#-acknowledgement">🙏Acknowledgement</a>

About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

✨ HumanEval+: 80x more tests than the original HumanEval!
✨ MBPP+: 35x more tests than the original MBPP!
✨ EvalPerf: evaluating the efficiency of LLM-generated code!
✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

EvalPlus: NeurIPS'23 paper, Google Slides, Poster
EvalPerf: COLM'24 paper, Poster, Documentation

📰 News

Below tracks the notable updates of EvalPlus:

[2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
[2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
[2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
Earlier:
(v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32).
(v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).
(v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6)
(v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)
(v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!
(v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.
(v0.1.0) HumanEval+ is released!

🔥 Quick Start

Code correctness evaluation: HumanEval(+) or MBPP(+)

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset humaneval                    \
                 --backend vllm                         \
                 --greedy

# Code execution within Docker
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
           evalplus.evaluate --dataset humaneval                       \
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl

Code efficiency evaluation: EvalPerf (*nix only)

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --backend vllm

Code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf                     \
                 --backend vllm                         \
                 --temperture 1.0                       \
                 --n-samples 100

# Code execution within Docker
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
           evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

🚀 LLM Backends

HuggingFace models

transformers backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend hf                           \
                  --greedy

[!Note]

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::

# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         \
                  --dataset [humaneval|mbpp]                     \
                  --backend hf                                   \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy

vllm backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --tp [TENSOR_PARALLEL_SIZE]            \
                  --greedy

openai compatible servers (e.g., vLLM):

# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend openai                       \
                  --base-url http://localhost:8000/v1    \
                  --greedy

OpenAI models

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o"            \
                  --dataset [humaneval|mbpp]  \
                  --backend openai            \
                  --greedy

Anthropic models

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
                  --dataset [humaneval|mbpp]        \
                  --backend anthropic               \
                  --greedy

Google Gemini models

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro"    \
                  --dataset [humaneval|mbpp]  \
                  --backend google            \
                  --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::

git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

📚 Documents

To learn more about how to use EvalPlus, please refer to:

📜 Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}

🙏 Acknowledgement

Core symbols most depended-on inside this repo

check_id

called by 39

tools/mbpp/fix_v010.py

get_human_eval_plus

called by 28

evalplus/data/humaneval.py

readlines

called by 22

evalplus/eval/utils.py

get_mbpp_plus

called by 17

evalplus/data/mbpp.py

evalplus/eval/utils.py

evalplus/eval/utils.py

Shape

Function 246

Method 81

Class 22

Route 8

Languages

Python100%

Modules by API surface

tools/_experimental/type_mut_for_eff.py29 symbols

evalplus/gen/type_mut.py19 symbols

tools/collect_valid_solutions.py16 symbols

evalplus/eval/utils.py13 symbols

tools/tsr/minimization.py11 symbols

evalplus/sanitize.py11 symbols

evalplus/perf/select_pe_tasks.py10 symbols

evalplus/evalperf.py10 symbols

evalplus/perf/profile.py9 symbols

tools/_experimental/evaluate_coverage.py8 symbols

evalplus/eval/__init__.py8 symbols

evalplus/data/utils.py8 symbols

Dependencies from manifests, versioned

mutmut2.1.0 · 1×

tree_sitter0.22.0 · 1×

For agents

$ claude mcp add evalplus \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact