MCPcopy Index your code
hub / github.com/brightmart/text_classification

github.com/brightmart/text_classification @main sqlite

repository ↗ · DeepWiki ↗
687 symbols 2,060 edges 95 files 333 documented · 48%
README

Text Classification

The purpose of this repository is to explore text classification methods in NLP with deep learning.

Update:

Customize an NLP API in three minutes, for free: NLP API Demo

Language Understanding Evaluation benchmark for Chinese(CLUE benchmark): run 10 tasks & 9 baselines with one line of code, performance comparision with details.

Releasing Pre-trained Model of ALBERT_Chinese Training with 30G+ Raw Chinese Corpus, xxlarge, xlarge and more, Target to match State of the Art performance in Chinese, 2019-Oct-7, During the National Day of China!

Large Amount of Chinese Corpus for NLP Available!

Google's BERT achieved new state of art result on more than 10 tasks in NLP using pre-train in language model then

fine-tuning. Pre-train TexCNN: idea from BERT for language understanding with running code and data set

Introduction

it has all kinds of baseline models for text classification.

it also support for multi-label classification where multi labels associate with an sentence or document.

although many of these models are simple, and may not get you to top level of the task. but some of these models are very

classic, so they may be good to serve as baseline models. each model has a test function under model class. you can run

it to performance toy task first. the model is independent from data set.

check here for formal report of large scale multi-label text classification with deep learning

several models here can also be used for modelling question answering (with or without context), or to do sequences generating.

we explore two seq2seq model(seq2seq with attention,transformer-attention is all you need) to do text classification.

and these two models can also be used for sequences generating and other tasks. if your task is a multi-label classification,

you can cast the problem to sequences generating.

we implement two memory network. one is dynamic memory network. previously it reached state of art in question

answering, sentiment analysis and sequence generating tasks. it is so called one model to do several different tasks,

and reach high performance. it has four modules. the key component is episodic memory module. it use gate mechanism to

performance attention, and use gated-gru to update episode memory, then it has another gru( in a vertical direction) to

performance hidden state update. it has ability to do transitive inference.

the second memory network we implemented is recurrent entity network: tracking state of the world. it has blocks of

key-value pairs as memory, run in parallel, which achieve new state of art. it can be used for modelling question

answering with contexts(or history). for example, you can let the model to read some sentences(as context), and ask a

question(as query), then ask the model to predict an answer; if you feed story same as query, then it can do

classification task.

To discuss ML/DL/NLP problems and get tech support from each other, you can join QQ group: 836811304

Models:

1) fastText 2) TextCNN 3) Bert:Pre-training of Deep Bidirectional Transformers for Language Understanding
4) TextRNN
5) RCNN
6) Hierarchical Attention Network
7) seq2seq with attention
8) Transformer("Attend Is All You Need") 9) Dynamic Memory Network 10) EntityNetwork:tracking state of the world 11) Ensemble models 12) Boosting:

for a single model, stack identical models together. each layer is a model. the result will be based on logits added together. the only connection between layers are label's weights. the front layer's prediction error rate of each label will become weight for the next layers. those labels with high error rate will have big weight. so later layer's will pay more attention to those mis-predicted labels, and try to fix previous mistake of former layer. as a result, we will get a much strong model.
check a00_boosting/boosting.py

and other models:

1) BiLstmTextRelation;

2) twoCNNTextRelation;

3) BiLstmTextRelationTwoRNN

Performance

(mulit-label label prediction task,ask to prediction top5, 3 million training data,full score:0.5)

Model fastText TextCNN TextRNN RCNN HierAtteNet Seq2seqAttn EntityNet DynamicMemory Transformer
Score 0.362 0.405 0.358 0.395 0.398 0.322 0.400 0.392 0.322
Training 10m 2h 10h 2h 2h 3h 3h 5h 7h
--------------------------------------------------------------------------------------------------

Bert model achieves 0.368 after first 9 epoch from validation set.

Ensemble of TextCNN,EntityNet,DynamicMemory: 0.411

Ensemble EntityNet,DynamicMemory: 0.403


Notice:

m stand for minutes; h stand for hours;

HierAtteNet means Hierarchical Attention Networkk;

Seq2seqAttn means Seq2seq with attention;

DynamicMemory means DynamicMemoryNetwork;

Transformer stand for model from 'Attention Is All You Need'.

Usage:

1) model is in xxx_model.py 2) run python xxx_train.py to train the model 3) run python xxx_predict.py to do inference(test).

Each model has a test method under the model class. you can run the test method first to check whether the model can work properly.


Environment:

python 2.7+ tensorflow 1.8

(tensorflow 1.1 to 1.13 should also works; most of models should also work fine in other tensorflow version, since we

use very few features bond to certain version.

if you use python3, it will be fine as long as you change print/try catch function in case you meet any error.

TextCNN model is already transfomed to python 3.6

Sample data: cached file of baidu or Google Drive:send me an email

to help you run this repository, currently we re-generate training/validation/test data and vocabulary/labels, and saved

them as cache file using h5py. we suggest you to download it from above link.

it contain everything you need to run this repository: data is pre-processed, you can start to train the model in a minute.

it's a zip file about 1.8G, contains 3 million training data. although after unzip it's quite big, but with the help of

hdf5, it only need a normal size of memory of computer(e.g.8 G or less) during training.

we use jupyter notebook: pre-processing.ipynb to pre-process data. you can have a better understanding of this task and

data by taking a look of it. you can also generate data by yourself in the way your want, just change few lines of code

using this jupyter notebook.

If you want to try a model now, you can dowload cached file from above, then go to folder 'a02_TextCNN', run

 python  p7_TextCNN_train.py

it will use data from cached files to train the model, and print loss and F1 score periodically.

old sample data source: if you need some sample data and word embedding per-trained on word2vec, you can find it in closed issues, such as: issue 3.

you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data

with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by " label".

if you want to know more detail about data set of text classification or task these models can be used, one of choose is below:

https://biendata.com/competition/zhihu/

Road Map

One way you can use this repository:

step 1: you can read through this article. you will get a general idea of various classic models used to do text classification.

step 2: pre-process data and/or download cached file.

  a. take a look a look of jupyter notebook('pre-processing.ipynb'), where you can familiar with this text

       classification task and data set. you will also know how we pre-process data and generate training/validation/test

       set. there are a list of things you can try at the end of this jupyter.

   b. download zip file that contains cached files, so you will have all necessary data, and can start to train models.

step 3: run some of models list here, and change some codes and configurations as you want, to get a good performance.

  record performances, and things you done that works, and things that are not.

  for example, you can take this sequence to explore:

  1) fasttext---> 2)TextCNN---> 3)Transformer---> 4)BERT

additionally, write your article about this topic, you can follow paper's style to write. you may need to read some papers

   on the way, many of these papers list in the # Reference at the end of this article; or join  a machine learning

   competition, and apply it with what you've learned.

Use Your Own Data:

replace data in 'data/sample_multiple_label.txt', and make sure format as below:

'word1 word2 word3 __label__l1 __label__l2 __label__l3'

where part1: 'word1 word2 word3' is input(X), part2: '__label__l1 __label__l2 __label__l3'

representing there are three labels: [l1,l2,l3]. between part1 and part2 there should be a empty string: ' '.

for example: each line (multiple labels) like:

'w5466 w138990 w1638 w4301 w6 w470 w202 c1834 c1400 c134 c57 c73 c699 c317 c184 __label__5626661657638885119 __label__4921793805334628695 __label__8904735555009151318'

where '5626661657638885119','4921793805334628695',‘8904735555009151318’ are three labels associate with this input string 'w5466 w138990...c699 c317 c184'

Notice:

Some util function is in data_util.py; check load_data_multilabel() of data_util for how process input and labels from raw data.

there is a function to load and assign pretrained word embedding to the model,where word embedding is pretrained in word2vec or fastText.

Pretrain Work Embedding:

if word2vec.load not works, you may load pretrained word embedding, especially for chinese word embedding use following lines:

import gensim

from gensim.models import KeyedVectors

word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=True, unicode_errors='ignore') #

or you can turn off use pretrain word embedding flag to false to disable loading word embedding.

Models Detail:

1.fastText:

implmentation of Bag of Tricks for Efficient Text Classification

after embed each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier.it use softmax function to compute the probability distribution over the predefined classes. then cross entropy is used to compute loss. bag of word representation does not consider word order. in order to take account of word order, n-gram features is used to capture some partial information about the local word order; when the number of classes is large, computing the linear classifier is computational expensive. so it usehierarchical softmax to speed training process. 1) use bi-gram and/or tri-gram 2) use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper)

result: performance is as good as paper, speed also very fast.

check: p5_fastTextB_model.py

alt text

2.TextCNN:

Implementation of Convolutional Neural Networks for Sentence Classification

Structure:embedding--->conv--->max pooling--->fully connected layer-------->softmax

Check: p7_TextCNN_model.py

In order to get very good result with TextCNN, you also need to read carefully about this paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification: it give you some insights of things that can affect performance. although you need to change some settings according to your specific task.

Convolutional Neural Network is main building box for solve problems of computer vision. Now we will show how CNN ca

Core symbols most depended-on inside this repo

_read_tsv
called by 17
a00_Bert/unused/run_classifier_multi_labels_bert.py
create_voabulary
called by 17
a07_Transformer/data_util_zhihu.py
create_voabulary_label
called by 17
a07_Transformer/data_util_zhihu.py
create_voabulary_label
called by 12
a02_TextCNN/other_experiement/data_util_zhihu.py
create_voabulary
called by 11
a02_TextCNN/other_experiement/data_util_zhihu.py
create_initializer
called by 10
a00_Bert/bert_modeling.py
dropout
called by 9
a00_Bert/bert_modeling.py
get_shape_list
called by 9
a00_Bert/bert_modeling.py

Shape

Function 380
Method 262
Class 45

Languages

Python100%

Modules by API surface

a00_Bert/unused/run_classifier_multi_labels_bert.py62 symbols
a00_Bert/bert_modeling.py32 symbols
a00_Bert/tokenization.py26 symbols
a00_Bert/run_classifier_predict_online.py19 symbols
aa1_data_util/data_util_zhihu.py18 symbols
a09_DynamicMemoryNet/a8_dynamic_memory_network.py18 symbols
a08_EntityNetwork/data_util_zhihu.py18 symbols
a07_Transformer/data_util_zhihu.py18 symbols
a05_HierarchicalAttentionNetwork/p1_HierarchicalAttention_model_transformer.py18 symbols
a02_TextCNN/other_experiement/data_util_zhihu.py18 symbols
a08_EntityNetwork/a3_entity_network.py17 symbols
a05_HierarchicalAttentionNetwork/p1_HierarchicalAttention_model.py16 symbols

For agents

$ claude mcp add text_classification \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact