hub / github.com/baichuan-inc/Baichuan2

github.com/baichuan-inc/Baichuan2 @main sqlite

repository ↗ · DeepWiki ↗

20 symbols 78 edges 4 files 1 documented · 5%

README

Baichuan 2

🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat• 🧩 Modelers

English | 中文

📖 Models Introduction
📊 Benchmark Results 🥇🥇🔥🔥
⚙️ Inference and Deployment
🛠️ Fine-tuning the Model
💾 Intermediate Checkpoints 🔥🔥
👥 Community and Ecosystem
📜 Disclaimer, License and Citation

Update

[2023.12.29] 🎉🎉🎉 We have released Baichuan2-13B-Chat v2 version. In this version: - Significantly improved the model's overall capabilities, especially in mathematics and logical reasoning, and complex instruction following.

Models Introduction

Baichuan 2 is the new generation of open-source large language models launched by Baichuan Intelligent Technology. It was trained on a high-quality corpus with 2.6 trillion tokens.
Baichuan 2 achieved the best performance of its size on multiple authoritative Chinese, English, and multi-language general and domain-specific benchmarks.
This release includes Base and Chat versions for 7B and 13B, and a 4bits quantized version for the Chat model.
All versions are fully open to academic research. Developers only need to apply via email and obtain official commercial permission to use it for free commercially.
For more information, welcome reading our technical report Baichuan 2: Open Large-scale Language Models.

The specific released versions and download links are shown in the table below:

	Base Models	Aligned Models	Aligned Models 4bits Quantized
7B	🤗 Baichuan2-7B-Base	🤗 Baichuan2-7B-Chat	🤗 Baichuan2-7B-Chat-4bits
13B	🤗 Baichuan2-13B-Base	🤗 Baichuan2-13B-Chat	🤗 Baichuan2-13B-Chat-4bits

Benchmark Results

We conducted extensive testing on authoritative Chinese, English and multi-language datasets across six domains: general, legal, medical, mathematics, code, and multi-language translation.

General Domain

In the general domain, we conducted 5-shot tests on the following datasets: - C-Eval is a comprehensive Chinese basic model evaluation dataset, covering 52 disciplines and four levels of difficulty. We used the dev set of this dataset as the source for few-shot learning and tested on the test set. Our evaluation approach followed that of Baichuan-7B. - MMLU is an English evaluation dataset comprising 57 tasks, encompassing elementary math, American history, computer science, law, etc. The difficulty ranges from high school level to expert level. It's a mainstream LLM evaluation dataset. We used its open-source evaluation approach. - CMMLU is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its official evaluation approach. - Gaokao is a dataset utilizing China's college entrance examination questions to evaluate large language models' abilities, focusing on linguistic proficiency and logical reasoning. We retained only its single-choice questions and conducted random partitioning. Our evaluation method is similar to that of C-Eval. - AGIEval aims to evaluate a model's general abilities in cognition and problem-solving related tasks. We retained only its four-option single-choice questions and did random partitioning. We used an evaluation scheme similar to C-Eval. - BBH is a challenging task subset of Big-Bench. Big-Bench currently includes 204 tasks. Task themes involve linguistics, child development, mathematics, common sense reasoning, biology, physics, societal biases, software development, etc. BBH consists of benchmark tasks extracted from the 204 Big-Bench tasks in which large models did not perform well.

7B Model Results

	C-Eval	MMLU	CMMLU	Gaokao	AGIEval	BBH
	5-shot	5-shot	5-shot	5-shot	5-shot	3-shot
GPT-4	68.40	83.93	70.33	66.15	63.27	75.12
GPT-3.5 Turbo	51.10	68.54	54.06	47.07	46.13	61.59
LLaMA-7B	27.10	35.10	26.75	27.81	28.17	32.38
LLaMA2-7B	28.90	45.73	31.38	25.97	26.53	39.16
MPT-7B	27.15	27.93	26.00	26.54	24.83	35.20
Falcon-7B	24.23	26.03	25.66	24.24	24.10	28.77
ChatGLM2-6B	50.20	45.90	49.00	49.44	45.28	31.65
Baichuan-7B	42.80	42.30	44.02	36.34	34.44	32.48
Baichuan2-7B-Base	54.00	54.16	57.07	47.47	42.73	41.56

13B Model Results

	C-Eval	MMLU	CMMLU	Gaokao	AGIEval	BBH
	5-shot	5-shot	5-shot	5-shot	5-shot	3-shot
GPT-4	68.40	83.93	70.33	66.15	63.27	75.12
GPT-3.5 Turbo	51.10	68.54	54.06	47.07	46.13	61.59
LLaMA-13B	28.50	46.30	31.15	28.23	28.22	37.89
LLaMA2-13B	35.80	55.09	37.99	30.83	32.29	46.98
Vicuna-13B	32.80	52.00	36.28	30.11	31.55	43.04
Chinese-Alpaca-Plus-13B	38.80	43.90	33.43	34.78	35.46	28.94
XVERSE-13B	53.70	55.21	58.44	44.69	42.54	38.06
Baichuan-13B-Base	52.40	51.60	55.30	49.69	43.20	43.01
Baichuan2-13B-Base	58.10	59.17	61.97	54.33	48.17	48.78

Law and Medicine

In the legal domain, we used the JEC-QA dataset. The JEC-QA dataset originates from China's National Judicial Examination. We retained only the multiple-choice questions from it. Our evaluation method was similar to that of C-Eval.

In the medical domain, we used medical-related subjects from general domain datasets (C-Eval, MMLU, CMMLU), as well as MedQA and MedMCQA. We followed an evaluation scheme similar to C-Eval. - For testing convenience, we used the val set from C-Eval for testing. - The MedQA dataset comes from medical exams in the US and China. We tested the USMLE and MCMLE subsets from the MedQA dataset, and used a version with five candidates. - The MedMCQA dataset originates from entrance exams of medical colleges in India. We retained only the multiple-choice questions. Since the test set doesn't have answers, we used the dev set for testing. - Medical-related subjects included in the general domain datasets are as follows: - C-Eval: clinical_medicine, basic_medicine - MMLU: clinical_knowledge, anatomy, college_medicine, college_biology, nutrition, virology, medical_genetics, professional_medicine - CMMLU: anatomy, clinical_knowledge, college_medicine, genetics, nutrition, traditional_chinese_medicine, virology

We conducted 5-shot tests on the above datasets.

7B Model Results

	JEC-QA	CEval-MMLU-CMMLU	MedQA-USMLE	MedQA-MCMLE	MedMCQA
	5-shot	5-shot	5-shot	5-shot	5-shot
GPT-4	59.32	77.16	80.28	74.58	72.51
GPT-3.5 Turbo	42.31	61.17	53.81	52.92	56.25
LLaMA-7B	27.45	33.34	24.12	21.72	27.45
LLaMA2-7B	29.20	36.75	27.49	24.78	37.93
MPT-7B	27.45	26.67	16.97	19.79	31.96
Falcon-7B	23.66	25.33	21.29	18.07	33.88
ChatGLM2-6B	40.76	44.54	26.24	45.53	30.22
Baichuan-7B	34.64	42.37	27.42	39.46	31.39
Baichuan2-7B-Base	44.46	56.39	32.68	54.93	41.73

13B Model Results

	JEC-QA	CEval-MMLU-CMMLU	MedQA-USMLE	MedQA-MCMLE	MedMCQA
	5-shot	5-shot	5-shot	5-shot	5-shot
GPT-4	59.32	77.16	80.28	74.58	72.51
GPT-3.5 Turbo	42.31	61.17	53.81	52.92	56.25
LLaMA-13B	27.54	35.14	28.83	23.38	39.52
LLaMA2-13B	34.08	47.42	35.04	29.74	42.12
Vicuna-13B	28.38	40.99	34.80	27.67	40.66
Chinese-Alpaca-Plus-13B	35.32	46.31	27.49	32.66	35.87
XVERSE-13B	46.42	58.08	32.99	58.76	41.34
Baichuan-13B-Base	41.34	51.77	29.07	43.67	39.60
Baichuan2-13B-Base	47.40	59.33	40.38	61.62	42.86

Mathematics and Code

In the mathematics domain, we used the OpenCompass evaluation framework and conducted 4-shot tests on the GSM8K and MATH datasets.

GSM8K is a dataset released by OpenAI, consisting of 8.5K high-quality linguistically diverse elementary school math application questions. It requires selecting the most reasonable solution based on a given scenario and two possible solutions.
The MATH dataset contains 12,500 math problems (of which 7,500 belong to the training set and 5,000 to the test set). These problems are collected from math competitions like AMC 10, AMC 12, AIME.

For the code domain, we used the HumanEval and MBPP datasets. Using OpenCompass, we performed a 0-shot test on HumanEval and a 3-shot test on the MBPP dataset. - Tasks in HumanEval include programming tasks encompassing language understanding, reasoning, algorithms, and basic math to evaluate the functional correctness of models and measure their problem-solving capability. - MBPP consists of a dataset with 974 Python short functions, textual descriptions of programs, and test cases to check their functional correctness.

7B Model Results

	GSM8K	MATH	HumanEval	MBPP

Core symbols most depended-on inside this repo

fine-tune/fine-tune.py

Shape

Function 11

Class 4

Method 4

Route 1

Languages

Python100%

Modules by API surface

fine-tune/fine-tune.py9 symbols

web_demo.py4 symbols

cli_demo.py4 symbols

OpenAI_api.py3 symbols

Dependencies from manifests, versioned

torch2.0.0 · 1×

For agents

$ claude mcp add Baichuan2 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/baichuan-inc/Baichuan2 @main sqlite

Baichuan 2

English | 中文

Table of Contents

Update

Models Introduction

Benchmark Results

General Domain

7B Model Results

13B Model Results

Law and Medicine

7B Model Results

13B Model Results

Mathematics and Code

7B Model Results

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents