hub / github.com/shibing624/pycorrector

github.com/shibing624/pycorrector @1.1.2 sqlite

repository ↗ · DeepWiki ↗ · release 1.1.2 ↗

493 symbols 1,960 edges 121 files 191 documented · 39%

README

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

pycorrector: useful python text correction toolkit

pycorrector: Chinese Text Error Correction Toolkit.

pycorrector Use the language model to detect errors, pinyin feature and shape feature to correct chinese text error, it can be used for Chinese Pinyin and stroke input method.

Features

language model

Kenlm
RNNLM

deep model

rnn_attention
seq2seq_attention
conv_seq2seq
transformer
bert
electra

Install

auto：pip install pycorrector
manual：

git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
python setup.py install

Install Requires

install kenlm

pip install https://github.com/kpu/kenlm/archive/master.zip

install others

pip install -r requirements.txt

Usage

Text Correction

import pycorrector

corrected_sent, detail = pycorrector.correct('少先队员因该为老人让坐')
print(corrected_sent, detail)

output:

少先队员应该为老人让座 [[('因该', '应该', 4, 6)], [('坐', '座', 10, 11)]]

model load from: ~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm, if not download auto, do it from file(2.8G).Correction

Error Detection

import pycorrector

idx_errors = pycorrector.detect('少先队员因该为老人让坐')
print(idx_errors)

output:

[['因该', 4, 6, 'word'], ['坐', 10, 11, 'char']]

return list, [error_word, begin_pos, end_pos, error_type]，pos index starts with 0.

English Seplling Error Correction

import pycorrector

sent_lst = ['what', 'hapenning', 'how', 'to', 'speling', 'it', 'you', 'can', 'gorrect', 'it']
for i in sent_lst:
    print(i, '=>', pycorrector.en_correct(i))

output:

what => what
hapenning => happening
how => how
to => to
speling => spelling
it => it
you => you
can => can
gorrect => correct
it => it

Command Line Usage

Command line

python -m pycorrector -h
usage: __main__.py [-h] -o OUTPUT [-n] [-d] input

@description:

positional arguments:
  input                 the input file path, file encode need utf-8.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        the output file path.
  -n, --no_char         disable char detect mode.
  -d, --detail          print detail info

case：

python -m pycorrector input.txt -o out.txt -n -d

input file：input.txt; output file：out.txt

Future work

P(c), the language model. We could create a better language model by collecting more data, and perhaps by using a little English morphology (such as adding "ility" or "able" to the end of a word).
P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the error. Clearly we could use a better model of the cost of edits. get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters.
It turns out that in many cases it is difficult to make a decision based only on a single word. This is most obvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another word anyway: correction('where') => 'where' (123); expected 'were' (452) We can't possibly know that correction('where') should be 'were' in at least one case, but should remain 'where' in other cases. But if the query had been correction('They where going') then it seems likely that "where" should be corrected to "were".
Finally, we could improve the implementation by making it much faster, without changing the results. We could re-implement in a compiled language rather than an interpreted one. We could cache the results of computations so that we don't have to repeat them multiple times. One word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually going.

Cite

@software{pycorrector,
  author = {Xu Ming},
  title = {{pycorrector: Text Error Correction Tool}},
  year = {2020},
  url = {https://github.com/shibing624/pycorrector},
}

License

Apache License 2.0

References

Core symbols most depended-on inside this repo

correct

called by 71

pycorrector/corrector.py

register_conv_template

called by 28

pycorrector/gpt/gpt_utils.py

eval_model_batch

called by 24

pycorrector/utils/evaluate_utils.py

update

called by 17

pycorrector/utils/get_file.py

is_chinese_char

called by 14

pycorrector/utils/text_utils.py

detect

called by 13

pycorrector/detector.py

check_detector_initialized

called by 12

pycorrector/detector.py

correct_batch

called by 12

pycorrector/corrector.py

Shape

Method 260

Function 160

Class 70

Route 3

Languages

Python100%

Modules by API surface

pycorrector/deepcontext/deepcontext_utils.py31 symbols

pycorrector/utils/langconv.py28 symbols

pycorrector/detector.py23 symbols

pycorrector/proper_corrector.py21 symbols

pycorrector/seq2seq/conv_seq2seq_model.py18 symbols

pycorrector/macbert/base_model.py18 symbols

pycorrector/gpt/gpt_utils.py17 symbols

pycorrector/corrector.py17 symbols

pycorrector/utils/text_utils.py16 symbols

pycorrector/seq2seq/conv_seq2seq_utils.py15 symbols

pycorrector/macbert/lr_scheduler.py12 symbols

pycorrector/en_spell_corrector.py12 symbols

Dependencies from manifests, versioned

fairseq0.12.2 · 1×

modelscope1.16.0 · 1×

pytorch-lightning1.1.2 · 1×

scikit-learn0.19.1 · 1×

torch1.3.1 · 1×

For agents

$ claude mcp add pycorrector \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/shibing624/pycorrector @1.1.2 sqlite

pycorrector: useful python text correction toolkit

Features

language model

deep model

Install

Install Requires

Usage

Command Line Usage

Future work

Further Reading

Cite

License

References

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents