MCPcopy
hub / github.com/baichuan-inc/Baichuan2

github.com/baichuan-inc/Baichuan2 @main sqlite

repository ↗ · DeepWiki ↗
20 symbols 78 edges 4 files 1 documented · 5%
README

Baichuan 2

🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat• 🧩 Modelers

license

English | 中文

Table of Contents

Update

[2023.12.29] 🎉🎉🎉 We have released Baichuan2-13B-Chat v2 version. In this version: - Significantly improved the model's overall capabilities, especially in mathematics and logical reasoning, and complex instruction following.

Models Introduction

  • Baichuan 2 is the new generation of open-source large language models launched by Baichuan Intelligent Technology. It was trained on a high-quality corpus with 2.6 trillion tokens.
  • Baichuan 2 achieved the best performance of its size on multiple authoritative Chinese, English, and multi-language general and domain-specific benchmarks.
  • This release includes Base and Chat versions for 7B and 13B, and a 4bits quantized version for the Chat model.
  • All versions are fully open to academic research. Developers only need to apply via email and obtain official commercial permission to use it for free commercially.
  • For more information, welcome reading our technical report Baichuan 2: Open Large-scale Language Models.

The specific released versions and download links are shown in the table below:

Base Models Aligned Models Aligned Models 4bits Quantized
7B 🤗 Baichuan2-7B-Base 🤗 Baichuan2-7B-Chat 🤗 Baichuan2-7B-Chat-4bits
13B 🤗 Baichuan2-13B-Base 🤗 Baichuan2-13B-Chat 🤗 Baichuan2-13B-Chat-4bits

Benchmark Results

We conducted extensive testing on authoritative Chinese, English and multi-language datasets across six domains: general, legal, medical, mathematics, code, and multi-language translation.

General Domain

In the general domain, we conducted 5-shot tests on the following datasets: - C-Eval is a comprehensive Chinese basic model evaluation dataset, covering 52 disciplines and four levels of difficulty. We used the dev set of this dataset as the source for few-shot learning and tested on the test set. Our evaluation approach followed that of Baichuan-7B. - MMLU is an English evaluation dataset comprising 57 tasks, encompassing elementary math, American history, computer science, law, etc. The difficulty ranges from high school level to expert level. It's a mainstream LLM evaluation dataset. We used its open-source evaluation approach. - CMMLU is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its official evaluation approach. - Gaokao is a dataset utilizing China's college entrance examination questions to evaluate large language models' abilities, focusing on linguistic proficiency and logical reasoning. We retained only its single-choice questions and conducted random partitioning. Our evaluation method is similar to that of C-Eval. - AGIEval aims to evaluate a model's general abilities in cognition and problem-solving related tasks. We retained only its four-option single-choice questions and did random partitioning. We used an evaluation scheme similar to C-Eval. - BBH is a challenging task subset of Big-Bench. Big-Bench currently includes 204 tasks. Task themes involve linguistics, child development, mathematics, common sense reasoning, biology, physics, societal biases, software development, etc. BBH consists of benchmark tasks extracted from the 204 Big-Bench tasks in which large models did not perform well.

7B Model Results

C-Eval MMLU CMMLU Gaokao AGIEval BBH
5-shot 5-shot 5-shot 5-shot 5-shot 3-shot
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38
LLaMA2-7B 28.90 45.73 31.38 25.97 26.53 39.16
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20
Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77
ChatGLM2-6B 50.20 45.90 49.00 49.44 45.28 31.65
Baichuan-7B 42.80 42.30 44.02 36.34 34.44 32.48
Baichuan2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56

13B Model Results

C-Eval MMLU CMMLU Gaokao AGIEval BBH
5-shot 5-shot 5-shot 5-shot 5-shot 3-shot
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89
LLaMA2-13B 35.80 55.09 37.99 30.83 32.29 46.98
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04
Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06
Baichuan-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01
Baichuan2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78

Law and Medicine

In the legal domain, we used the JEC-QA dataset. The JEC-QA dataset originates from China's National Judicial Examination. We retained only the multiple-choice questions from it. Our evaluation method was similar to that of C-Eval.

In the medical domain, we used medical-related subjects from general domain datasets (C-Eval, MMLU, CMMLU), as well as MedQA and MedMCQA. We followed an evaluation scheme similar to C-Eval. - For testing convenience, we used the val set from C-Eval for testing. - The MedQA dataset comes from medical exams in the US and China. We tested the USMLE and MCMLE subsets from the MedQA dataset, and used a version with five candidates. - The MedMCQA dataset originates from entrance exams of medical colleges in India. We retained only the multiple-choice questions. Since the test set doesn't have answers, we used the dev set for testing. - Medical-related subjects included in the general domain datasets are as follows: - C-Eval: clinical_medicine, basic_medicine - MMLU: clinical_knowledge, anatomy, college_medicine, college_biology, nutrition, virology, medical_genetics, professional_medicine - CMMLU: anatomy, clinical_knowledge, college_medicine, genetics, nutrition, traditional_chinese_medicine, virology

We conducted 5-shot tests on the above datasets.

7B Model Results

JEC-QA CEval-MMLU-CMMLU MedQA-USMLE MedQA-MCMLE MedMCQA
5-shot 5-shot 5-shot 5-shot 5-shot
GPT-4 59.32 77.16 80.28 74.58 72.51
GPT-3.5 Turbo 42.31 61.17 53.81 52.92 56.25
LLaMA-7B 27.45 33.34 24.12 21.72 27.45
LLaMA2-7B 29.20 36.75 27.49 24.78 37.93
MPT-7B 27.45 26.67 16.97 19.79 31.96
Falcon-7B 23.66 25.33 21.29 18.07 33.88
ChatGLM2-6B 40.76 44.54 26.24 45.53 30.22
Baichuan-7B 34.64 42.37 27.42 39.46 31.39
Baichuan2-7B-Base 44.46 56.39 32.68 54.93 41.73

13B Model Results

JEC-QA CEval-MMLU-CMMLU MedQA-USMLE MedQA-MCMLE MedMCQA
5-shot 5-shot 5-shot 5-shot 5-shot
GPT-4 59.32 77.16 80.28 74.58 72.51
GPT-3.5 Turbo 42.31 61.17 53.81 52.92 56.25
LLaMA-13B 27.54 35.14 28.83 23.38 39.52
LLaMA2-13B 34.08 47.42 35.04 29.74 42.12
Vicuna-13B 28.38 40.99 34.80 27.67 40.66
Chinese-Alpaca-Plus-13B 35.32 46.31 27.49 32.66 35.87
XVERSE-13B 46.42 58.08 32.99 58.76 41.34
Baichuan-13B-Base 41.34 51.77 29.07 43.67 39.60
Baichuan2-13B-Base 47.40 59.33 40.38 61.62 42.86

Mathematics and Code

In the mathematics domain, we used the OpenCompass evaluation framework and conducted 4-shot tests on the GSM8K and MATH datasets.

  • GSM8K is a dataset released by OpenAI, consisting of 8.5K high-quality linguistically diverse elementary school math application questions. It requires selecting the most reasonable solution based on a given scenario and two possible solutions.
  • The MATH dataset contains 12,500 math problems (of which 7,500 belong to the training set and 5,000 to the test set). These problems are collected from math competitions like AMC 10, AMC 12, AIME.

For the code domain, we used the HumanEval and MBPP datasets. Using OpenCompass, we performed a 0-shot test on HumanEval and a 3-shot test on the MBPP dataset. - Tasks in HumanEval include programming tasks encompassing language understanding, reasoning, algorithms, and basic math to evaluate the functional correctness of models and measure their problem-solving capability. - MBPP consists of a dataset with 974 Python short functions, textual descriptions of programs, and test cases to check their functional correctness.

7B Model Results

GSM8K MATH HumanEval MBPP

Core symbols most depended-on inside this repo

clear_screen
called by 2
cli_demo.py
preprocessing
called by 2
fine-tune/fine-tune.py
generate_response
called by 1
OpenAI_api.py
init_model
called by 1
cli_demo.py
vim_input
called by 1
cli_demo.py
main
called by 1
cli_demo.py
init_model
called by 1
web_demo.py
init_chat_history
called by 1
web_demo.py

Shape

Function 11
Class 4
Method 4
Route 1

Languages

Python100%

Modules by API surface

fine-tune/fine-tune.py9 symbols
web_demo.py4 symbols
cli_demo.py4 symbols
OpenAI_api.py3 symbols

Dependencies from manifests, versioned

torch2.0.0 · 1×

For agents

$ claude mcp add Baichuan2 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact