🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat• 🧩 Modelers
[2023.12.29] 🎉🎉🎉 We have released Baichuan2-13B-Chat v2 version. In this version: - Significantly improved the model's overall capabilities, especially in mathematics and logical reasoning, and complex instruction following.
The specific released versions and download links are shown in the table below:
| Base Models | Aligned Models | Aligned Models 4bits Quantized | |
|---|---|---|---|
| 7B | 🤗 Baichuan2-7B-Base | 🤗 Baichuan2-7B-Chat | 🤗 Baichuan2-7B-Chat-4bits |
| 13B | 🤗 Baichuan2-13B-Base | 🤗 Baichuan2-13B-Chat | 🤗 Baichuan2-13B-Chat-4bits |
We conducted extensive testing on authoritative Chinese, English and multi-language datasets across six domains: general, legal, medical, mathematics, code, and multi-language translation.
In the general domain, we conducted 5-shot tests on the following datasets: - C-Eval is a comprehensive Chinese basic model evaluation dataset, covering 52 disciplines and four levels of difficulty. We used the dev set of this dataset as the source for few-shot learning and tested on the test set. Our evaluation approach followed that of Baichuan-7B. - MMLU is an English evaluation dataset comprising 57 tasks, encompassing elementary math, American history, computer science, law, etc. The difficulty ranges from high school level to expert level. It's a mainstream LLM evaluation dataset. We used its open-source evaluation approach. - CMMLU is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its official evaluation approach. - Gaokao is a dataset utilizing China's college entrance examination questions to evaluate large language models' abilities, focusing on linguistic proficiency and logical reasoning. We retained only its single-choice questions and conducted random partitioning. Our evaluation method is similar to that of C-Eval. - AGIEval aims to evaluate a model's general abilities in cognition and problem-solving related tasks. We retained only its four-option single-choice questions and did random partitioning. We used an evaluation scheme similar to C-Eval. - BBH is a challenging task subset of Big-Bench. Big-Bench currently includes 204 tasks. Task themes involve linguistics, child development, mathematics, common sense reasoning, biology, physics, societal biases, software development, etc. BBH consists of benchmark tasks extracted from the 204 Big-Bench tasks in which large models did not perform well.
| C-Eval | MMLU | CMMLU | Gaokao | AGIEval | BBH | |
|---|---|---|---|---|---|---|
| 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | 3-shot | |
| GPT-4 | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
| GPT-3.5 Turbo | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
| LLaMA-7B | 27.10 | 35.10 | 26.75 | 27.81 | 28.17 | 32.38 |
| LLaMA2-7B | 28.90 | 45.73 | 31.38 | 25.97 | 26.53 | 39.16 |
| MPT-7B | 27.15 | 27.93 | 26.00 | 26.54 | 24.83 | 35.20 |
| Falcon-7B | 24.23 | 26.03 | 25.66 | 24.24 | 24.10 | 28.77 |
| ChatGLM2-6B | 50.20 | 45.90 | 49.00 | 49.44 | 45.28 | 31.65 |
| Baichuan-7B | 42.80 | 42.30 | 44.02 | 36.34 | 34.44 | 32.48 |
| Baichuan2-7B-Base | 54.00 | 54.16 | 57.07 | 47.47 | 42.73 | 41.56 |
| C-Eval | MMLU | CMMLU | Gaokao | AGIEval | BBH | |
|---|---|---|---|---|---|---|
| 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | 3-shot | |
| GPT-4 | 68.40 | 83.93 | 70.33 | 66.15 | 63.27 | 75.12 |
| GPT-3.5 Turbo | 51.10 | 68.54 | 54.06 | 47.07 | 46.13 | 61.59 |
| LLaMA-13B | 28.50 | 46.30 | 31.15 | 28.23 | 28.22 | 37.89 |
| LLaMA2-13B | 35.80 | 55.09 | 37.99 | 30.83 | 32.29 | 46.98 |
| Vicuna-13B | 32.80 | 52.00 | 36.28 | 30.11 | 31.55 | 43.04 |
| Chinese-Alpaca-Plus-13B | 38.80 | 43.90 | 33.43 | 34.78 | 35.46 | 28.94 |
| XVERSE-13B | 53.70 | 55.21 | 58.44 | 44.69 | 42.54 | 38.06 |
| Baichuan-13B-Base | 52.40 | 51.60 | 55.30 | 49.69 | 43.20 | 43.01 |
| Baichuan2-13B-Base | 58.10 | 59.17 | 61.97 | 54.33 | 48.17 | 48.78 |
In the legal domain, we used the JEC-QA dataset. The JEC-QA dataset originates from China's National Judicial Examination. We retained only the multiple-choice questions from it. Our evaluation method was similar to that of C-Eval.
In the medical domain, we used medical-related subjects from general domain datasets (C-Eval, MMLU, CMMLU), as well as MedQA and MedMCQA. We followed an evaluation scheme similar to C-Eval. - For testing convenience, we used the val set from C-Eval for testing. - The MedQA dataset comes from medical exams in the US and China. We tested the USMLE and MCMLE subsets from the MedQA dataset, and used a version with five candidates. - The MedMCQA dataset originates from entrance exams of medical colleges in India. We retained only the multiple-choice questions. Since the test set doesn't have answers, we used the dev set for testing. - Medical-related subjects included in the general domain datasets are as follows: - C-Eval: clinical_medicine, basic_medicine - MMLU: clinical_knowledge, anatomy, college_medicine, college_biology, nutrition, virology, medical_genetics, professional_medicine - CMMLU: anatomy, clinical_knowledge, college_medicine, genetics, nutrition, traditional_chinese_medicine, virology
We conducted 5-shot tests on the above datasets.
| JEC-QA | CEval-MMLU-CMMLU | MedQA-USMLE | MedQA-MCMLE | MedMCQA | |
|---|---|---|---|---|---|
| 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | |
| GPT-4 | 59.32 | 77.16 | 80.28 | 74.58 | 72.51 |
| GPT-3.5 Turbo | 42.31 | 61.17 | 53.81 | 52.92 | 56.25 |
| LLaMA-7B | 27.45 | 33.34 | 24.12 | 21.72 | 27.45 |
| LLaMA2-7B | 29.20 | 36.75 | 27.49 | 24.78 | 37.93 |
| MPT-7B | 27.45 | 26.67 | 16.97 | 19.79 | 31.96 |
| Falcon-7B | 23.66 | 25.33 | 21.29 | 18.07 | 33.88 |
| ChatGLM2-6B | 40.76 | 44.54 | 26.24 | 45.53 | 30.22 |
| Baichuan-7B | 34.64 | 42.37 | 27.42 | 39.46 | 31.39 |
| Baichuan2-7B-Base | 44.46 | 56.39 | 32.68 | 54.93 | 41.73 |
| JEC-QA | CEval-MMLU-CMMLU | MedQA-USMLE | MedQA-MCMLE | MedMCQA | |
|---|---|---|---|---|---|
| 5-shot | 5-shot | 5-shot | 5-shot | 5-shot | |
| GPT-4 | 59.32 | 77.16 | 80.28 | 74.58 | 72.51 |
| GPT-3.5 Turbo | 42.31 | 61.17 | 53.81 | 52.92 | 56.25 |
| LLaMA-13B | 27.54 | 35.14 | 28.83 | 23.38 | 39.52 |
| LLaMA2-13B | 34.08 | 47.42 | 35.04 | 29.74 | 42.12 |
| Vicuna-13B | 28.38 | 40.99 | 34.80 | 27.67 | 40.66 |
| Chinese-Alpaca-Plus-13B | 35.32 | 46.31 | 27.49 | 32.66 | 35.87 |
| XVERSE-13B | 46.42 | 58.08 | 32.99 | 58.76 | 41.34 |
| Baichuan-13B-Base | 41.34 | 51.77 | 29.07 | 43.67 | 39.60 |
| Baichuan2-13B-Base | 47.40 | 59.33 | 40.38 | 61.62 | 42.86 |
In the mathematics domain, we used the OpenCompass evaluation framework and conducted 4-shot tests on the GSM8K and MATH datasets.
For the code domain, we used the HumanEval and MBPP datasets. Using OpenCompass, we performed a 0-shot test on HumanEval and a 3-shot test on the MBPP dataset. - Tasks in HumanEval include programming tasks encompassing language understanding, reasoning, algorithms, and basic math to evaluate the functional correctness of models and measure their problem-solving capability. - MBPP consists of a dataset with 974 Python short functions, textual descriptions of programs, and test cases to check their functional correctness.
| GSM8K | MATH | HumanEval | MBPP | |
|---|---|---|---|---|
$ claude mcp add Baichuan2 \
-- python -m otcore.mcp_server <graph>