XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.
For a detailed description of technical details and experimental results, please refer to our paper:
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
(*: equal contribution)
Preprint 2019
As of June 19, 2019, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks. Below are some comparison between XLNet-Large and BERT-Large, which have similar model sizes:
| Model | RACE accuracy | SQuAD1.1 EM | SQuAD2.0 EM |
|---|---|---|---|
| BERT-Large | 72.0 | 84.1 | 78.98 |
| XLNet-Base | 80.18 | ||
| XLNet-Large | 81.75 | 88.95 | 86.12 |
We use SQuAD dev results in the table to exclude other factors such as using additional training data or other data augmentation techniques. See SQuAD leaderboard for test numbers.
| Model | IMDB | Yelp-2 | Yelp-5 | DBpedia | Amazon-2 | Amazon-5 |
|---|---|---|---|---|---|---|
| BERT-Large | 4.51 | 1.89 | 29.32 | 0.64 | 2.63 | 34.17 |
| XLNet-Large | 3.79 | 1.55 | 27.80 | 0.62 | 2.40 | 32.26 |
The above numbers are error rates.
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT-Large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-Base | 86.8 | 91.7 | 91.4 | 74.0 | 94.7 | 88.2 | 60.2 | 89.5 |
| XLNet-Large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
We use single-task dev results in the table to exclude other factors such as multi-task learning or using ensembles.
As of July 16, 2019, the following models have been made available:
* XLNet-Large, Cased: 24-layer, 1024-hidden, 16-heads
* XLNet-Base, Cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).
We only release cased models for now because on the tasks we consider, we found: (1) for the base setting, cased and uncased models have similar performance; (2) for the large setting, cased models are a bit better in some tasks.
Each .zip file contains three items:
* A TensorFlow checkpoint (xlnet_model.ckpt) containing the pre-trained weights (which is actually 3 files).
* A Sentence Piece model (spiece.model) used for (de)tokenization.
* A config file (xlnet_config.json) which specifies the hyperparameters of the model.
We also plan to continuously release more pretrained models under different settings, including: * A pretrained model that is finetuned on Wikipedia. This can be used for tasks with Wikipedia text such as SQuAD and HotpotQA. * Pretrained models with other hyperparameter configurations, targeting specific downstream tasks. * Pretrained models that benefit from new techniques.
To receive notifications about updates, announcements and new releases, we recommend subscribing to the XLNet on Google Groups.
As of June 19, 2019, this code base has been tested with TensorFlow 1.13.1 under Python2.
XLNet-Large SOTA results in the paper using GPUs with 12GB - 16GB of RAM, because a 16GB GPU is only able to hold a single sequence with length 512 for XLNet-Large. Therefore, a large number (ranging from 32 to 128, equal to batch_size) of GPUs are required to reproduce many results in the paper.Given the memory issue mentioned above, using the default finetuning scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on a single 16GB GPU with TensorFlow 1.13.1:
| System | Seq Length | Max Batch Size |
|---|---|---|
XLNet-Base |
64 | 120 |
| ... | 128 | 56 |
| ... | 256 | 24 |
| ... | 512 | 8 |
XLNet-Large |
64 | 16 |
| ... | 128 | 8 |
| ... | 256 | 2 |
| ... | 512 | 1 |
In most cases, it is possible to reduce the batch size train_batch_size or the maximum sequence length max_seq_length to fit in given hardware. The decrease in performance depends on the task and the available resources.
The code used to perform classification/regression finetuning is in run_classifier.py. It also contains examples for standard one-document classification, one-document regression, and document pair classification. Here, we provide two concrete examples of how run_classifier.py can be used.
From here on, we assume XLNet-Large and XLNet-base has been downloaded to $LARGE_DIR and $BASE_DIR respectively.
Download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.
Perform multi-GPU (4 V100 GPUs) finetuning with XLNet-Large by running
shell
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
--do_train=True \
--do_eval=False \
--task_name=sts-b \
--data_dir=${GLUE_DIR}/STS-B \
--output_dir=proc_data/sts-b \
--model_dir=exp/sts-b \
--uncased=False \
--spiece_model_file=${LARGE_DIR}/spiece.model \
--model_config_path=${LARGE_DIR}/xlnet_config.json \
--init_checkpoint=${LARGE_DIR}/xlnet_model.ckpt \
--max_seq_length=128 \
--train_batch_size=8 \
--num_hosts=1 \
--num_core_per_host=4 \
--learning_rate=5e-5 \
--train_steps=1200 \
--warmup_steps=120 \
--save_steps=600 \
--is_regression=True
```shell CUDA_VISIBLE_DEVICES=0 python run_classifier.py \ --do_train=False \ --do_eval=True \ --task_name=sts-b \ --data_dir=${GLUE_DIR}/STS-B \ --output_dir=proc_data/sts-b \ --model_dir=exp/sts-b \ --uncased=False \ --spiece_model_file=${LARGE_DIR}/spiece.model \ --model_config_path=${LARGE_DIR}/xlnet_config.json \ --max_seq_length=128 \ --eval_batch_size=8 \ --num_hosts=1 \ --num_core_per_host=1 \ --eval_all_ckpt=True \ --is_regression=True
# Expected performance: "eval_pearsonr 0.916+ " ```
Notes:
num_core_per_host denotes the number of GPUs to use.train_batch_size refers to the per-GPU batch size.eval_all_ckpt allows one to evaluate all saved checkpoints (save frequency is controlled by save_steps) after training finishes and choose the best model based on dev performance.data_dir and output_dir refer to the directories of the "raw data" and "preprocessed tfrecords" respectively, while model_dir is the working directory for saving checkpoints and tensorflow events. model_dir should be set as a separate folder to init_checkpoint.--train_batch_size=32 and --num_core_per_host=1, along with according changes in init_checkpoint and model_config_path.train_batch_size and increase num_core_per_host to use the same training setting.shell
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
Launch a Google cloud TPU V3-8 instance (see the Google Cloud TPU tutorial for how to set up Cloud TPUs).
Set up your Google storage bucket path $GS_ROOT and move the IMDB dataset and pretrained checkpoint into your Google storage.
Perform TPU finetuning with XLNet-Large by running
```shell python run_classifier.py \ --use_tpu=True \ --tpu=${TPU_NAME} \ --do_train=True \ --do_eval=True \ --eval_all_ckpt=True \ --task_name=imdb \ --data_dir=${IMDB_DIR} \ --output_dir=${GS_ROOT}/proc_data/imdb \ --model_dir=${GS_ROOT}/exp/imdb \ --uncased=False \ --spiece_model_file=${LARGE_DIR}/spiece.model \ --model_config_path=${GS_ROOT}/${LARGE_DIR}/model_config.json \ --init_checkpoint=${GS_ROOT}/${LARGE_DIR}/xlnet_model.ckpt \ --max_seq_length=512 \ --train_batch_size=32 \ --eval_batch_size=8 \ --num_hosts=1 \ --num_core_per_host=8 \ --learning_rate=2e-5 \ --train_steps=4000 \ --warmup_steps=500 \ --save_steps=500 \ --iterations=500
# Expected performance: "eval_accuracy 0.962+ " ```
Notes:
data_dir and spiece_model_file both use a local path rather than a Google Storage path. The reason is that data preprocessing is actually performed locally. Hence, using local paths leads to a faster preprocessing speed.The code for the SQuAD dataset is included in run_squad.py.
To run the code:
(1) Download the SQuAD2.0 dataset into $SQUAD_DIR by:
mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
(2) Perform data preprocessing using the script scripts/prepro_squad.sh.
This will take quite some time in order to accurately map character positions (raw data) to sentence piece positions (used for training).
For faster parallel preprocessing, please refer to the flags --num_proc and --proc_id in run_squad.py.
(3) Perform training and evaluation.
For the best performance, XLNet-Large uses sequence length 512 and batch size 48 for training.
As a result, reproducing the best result with GPUs is quite difficult.
For training with one TPU v3-8, one can simply run the script scripts/tpu_squad_large.sh after both the TPU and Google storage have been setup.
run_squad.py will automatically perform threshold searching on the dev set of squad and output the score. With scripts/tpu_squad_large.sh, the expected F1 score should be around 88.6 (median of our multiple runs).Alternatively, one can use XLNet-Base with GPUs (e.g. three V100). One set of reasonable hyper-parameters can be found in the script scripts/gpu_squad_base.sh.
The code for the reading comprehension task RACE is included in run_race.py.
To run the code:
(1) Download the RACE dataset from the official website and unpack the raw data to $RACE_DIR.
(2) Perform training and evaluation:
script/tpu_race_large_bsz32.sh for this setting.script/tpu_race_large_bsz8.sh).[An exampl
$ claude mcp add xlnet \
-- python -m otcore.mcp_server <graph>