The purpose of this repository is to explore text classification methods in NLP with deep learning.
Customize an NLP API in three minutes, for free: NLP API Demo
Language Understanding Evaluation benchmark for Chinese(CLUE benchmark): run 10 tasks & 9 baselines with one line of code, performance comparision with details.
Releasing Pre-trained Model of ALBERT_Chinese Training with 30G+ Raw Chinese Corpus, xxlarge, xlarge and more, Target to match State of the Art performance in Chinese, 2019-Oct-7, During the National Day of China!
Large Amount of Chinese Corpus for NLP Available!
Google's BERT achieved new state of art result on more than 10 tasks in NLP using pre-train in language model then
fine-tuning. Pre-train TexCNN: idea from BERT for language understanding with running code and data set
it has all kinds of baseline models for text classification.
it also support for multi-label classification where multi labels associate with an sentence or document.
although many of these models are simple, and may not get you to top level of the task. but some of these models are very
classic, so they may be good to serve as baseline models. each model has a test function under model class. you can run
it to performance toy task first. the model is independent from data set.
check here for formal report of large scale multi-label text classification with deep learning
several models here can also be used for modelling question answering (with or without context), or to do sequences generating.
we explore two seq2seq model(seq2seq with attention,transformer-attention is all you need) to do text classification.
and these two models can also be used for sequences generating and other tasks. if your task is a multi-label classification,
you can cast the problem to sequences generating.
we implement two memory network. one is dynamic memory network. previously it reached state of art in question
answering, sentiment analysis and sequence generating tasks. it is so called one model to do several different tasks,
and reach high performance. it has four modules. the key component is episodic memory module. it use gate mechanism to
performance attention, and use gated-gru to update episode memory, then it has another gru( in a vertical direction) to
performance hidden state update. it has ability to do transitive inference.
the second memory network we implemented is recurrent entity network: tracking state of the world. it has blocks of
key-value pairs as memory, run in parallel, which achieve new state of art. it can be used for modelling question
answering with contexts(or history). for example, you can let the model to read some sentences(as context), and ask a
question(as query), then ask the model to predict an answer; if you feed story same as query, then it can do
classification task.
To discuss ML/DL/NLP problems and get tech support from each other, you can join QQ group: 836811304
1) fastText
2) TextCNN
3) Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding
4) TextRNN
5) RCNN
6) Hierarchical Attention Network
7) seq2seq with attention
8) Transformer("Attend Is All You Need")
9) Dynamic Memory Network
10) EntityNetwork:tracking state of the world
11) Ensemble models
12) Boosting:
for a single model, stack identical models together. each layer is a model. the result will be based on logits added together. the only connection between layers are label's weights. the front layer's prediction error rate of each label will become weight for the next layers. those labels with high error rate will have big weight. so later layer's will pay more attention to those mis-predicted labels, and try to fix previous mistake of former layer. as a result, we will get a much strong model.
check a00_boosting/boosting.py
and other models:
1) BiLstmTextRelation;
2) twoCNNTextRelation;
3) BiLstmTextRelationTwoRNN
(mulit-label label prediction task,ask to prediction top5, 3 million training data,full score:0.5)
| Model | fastText | TextCNN | TextRNN | RCNN | HierAtteNet | Seq2seqAttn | EntityNet | DynamicMemory | Transformer |
|---|---|---|---|---|---|---|---|---|---|
| Score | 0.362 | 0.405 | 0.358 | 0.395 | 0.398 | 0.322 | 0.400 | 0.392 | 0.322 |
| Training | 10m | 2h | 10h | 2h | 2h | 3h | 3h | 5h | 7h |
| -------------------------------------------------------------------------------------------------- |
Bert model achieves 0.368 after first 9 epoch from validation set.
Ensemble of TextCNN,EntityNet,DynamicMemory: 0.411
Ensemble EntityNet,DynamicMemory: 0.403
Notice:
m stand for minutes; h stand for hours;
HierAtteNet means Hierarchical Attention Networkk;
Seq2seqAttn means Seq2seq with attention;
DynamicMemory means DynamicMemoryNetwork;
Transformer stand for model from 'Attention Is All You Need'.
1) model is in xxx_model.py
2) run python xxx_train.py to train the model
3) run python xxx_predict.py to do inference(test).
Each model has a test method under the model class. you can run the test method first to check whether the model can work properly.
python 2.7+ tensorflow 1.8
(tensorflow 1.1 to 1.13 should also works; most of models should also work fine in other tensorflow version, since we
use very few features bond to certain version.
if you use python3, it will be fine as long as you change print/try catch function in case you meet any error.
TextCNN model is already transfomed to python 3.6
to help you run this repository, currently we re-generate training/validation/test data and vocabulary/labels, and saved
them as cache file using h5py. we suggest you to download it from above link.
it contain everything you need to run this repository: data is pre-processed, you can start to train the model in a minute.
it's a zip file about 1.8G, contains 3 million training data. although after unzip it's quite big, but with the help of
hdf5, it only need a normal size of memory of computer(e.g.8 G or less) during training.
we use jupyter notebook: pre-processing.ipynb to pre-process data. you can have a better understanding of this task and
data by taking a look of it. you can also generate data by yourself in the way your want, just change few lines of code
using this jupyter notebook.
If you want to try a model now, you can dowload cached file from above, then go to folder 'a02_TextCNN', run
python p7_TextCNN_train.py
it will use data from cached files to train the model, and print loss and F1 score periodically.
old sample data source: if you need some sample data and word embedding per-trained on word2vec, you can find it in closed issues, such as: issue 3.
you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data
with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by " label".
if you want to know more detail about data set of text classification or task these models can be used, one of choose is below:
https://biendata.com/competition/zhihu/
One way you can use this repository:
step 1: you can read through this article. you will get a general idea of various classic models used to do text classification.
step 2: pre-process data and/or download cached file.
a. take a look a look of jupyter notebook('pre-processing.ipynb'), where you can familiar with this text
classification task and data set. you will also know how we pre-process data and generate training/validation/test
set. there are a list of things you can try at the end of this jupyter.
b. download zip file that contains cached files, so you will have all necessary data, and can start to train models.
step 3: run some of models list here, and change some codes and configurations as you want, to get a good performance.
record performances, and things you done that works, and things that are not.
for example, you can take this sequence to explore:
1) fasttext---> 2)TextCNN---> 3)Transformer---> 4)BERT
additionally, write your article about this topic, you can follow paper's style to write. you may need to read some papers
on the way, many of these papers list in the # Reference at the end of this article; or join a machine learning
competition, and apply it with what you've learned.
replace data in 'data/sample_multiple_label.txt', and make sure format as below:
'word1 word2 word3 __label__l1 __label__l2 __label__l3'
where part1: 'word1 word2 word3' is input(X), part2: '__label__l1 __label__l2 __label__l3'
representing there are three labels: [l1,l2,l3]. between part1 and part2 there should be a empty string: ' '.
for example: each line (multiple labels) like:
'w5466 w138990 w1638 w4301 w6 w470 w202 c1834 c1400 c134 c57 c73 c699 c317 c184 __label__5626661657638885119 __label__4921793805334628695 __label__8904735555009151318'
where '5626661657638885119','4921793805334628695',‘8904735555009151318’ are three labels associate with this input string 'w5466 w138990...c699 c317 c184'
Notice:
Some util function is in data_util.py; check load_data_multilabel() of data_util for how process input and labels from raw data.
there is a function to load and assign pretrained word embedding to the model,where word embedding is pretrained in word2vec or fastText.
if word2vec.load not works, you may load pretrained word embedding, especially for chinese word embedding use following lines:
import gensim
from gensim.models import KeyedVectors
word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=True, unicode_errors='ignore') #
or you can turn off use pretrain word embedding flag to false to disable loading word embedding.
implmentation of Bag of Tricks for Efficient Text Classification
after embed each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier.it use softmax function to compute the probability distribution over the predefined classes. then cross entropy is used to compute loss. bag of word representation does not consider word order. in order to take account of word order, n-gram features is used to capture some partial information about the local word order; when the number of classes is large, computing the linear classifier is computational expensive. so it usehierarchical softmax to speed training process. 1) use bi-gram and/or tri-gram 2) use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper)
result: performance is as good as paper, speed also very fast.
check: p5_fastTextB_model.py
Implementation of Convolutional Neural Networks for Sentence Classification
Structure:embedding--->conv--->max pooling--->fully connected layer-------->softmax
Check: p7_TextCNN_model.py
In order to get very good result with TextCNN, you also need to read carefully about this paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification: it give you some insights of things that can affect performance. although you need to change some settings according to your specific task.
Convolutional Neural Network is main building box for solve problems of computer vision. Now we will show how CNN ca
$ claude mcp add text_classification \
-- python -m otcore.mcp_server <graph>