hub / github.com/bazingagin/npc_gzip

github.com/bazingagin/npc_gzip @v0.1.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.1.1 ↗

152 symbols 501 edges 27 files 70 documented · 46%

README

Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

This paper is accepted to Findings of ACL2023.

Getting Started

This codebase is available on pypi.org via:

pip install npc-gzip

Usage

See the examples directory for example usage.

Testing

This package utilizes poetry to maintain its dependencies and pytest to execute tests. To get started running the tests:

poetry shell
poetry install
pytest

Original Codebase

Require

See requirements.txt.

Install requirements in a clean environment:

conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt

Run

python main_text.py

By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by --num_test, --num_train.

--compressor <gzip, lzma, bz2>
--dataset <AG_NEWS, SogouNews, DBpedia, YahooAnswers, 20News, Ohsumed_single, R8, R52, kinnews, kirnews, swahili, filipino> [Note that for small datasets like kinnews, default 100-shot is too big, need to set --num_test and --num_train.]
--num_train <INT>
--num_test <INT>
--data_dir <DIR> [This needs to be specified for R8, R52 and Ohsumed.]
--all_test [This will use the whole test dataset.]
--all_train
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]

Calculate Accuracy (Optional)

If we want to calculate accuracy from recorded distance file <DISTANCE DIR>, use

python main_text.py --record --score --distance_fn <DISTANCE DIR>

to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.

Use Custom Dataset

You can use your own custom dataset by passing custom to --dataset; pass the data directory that contains train.txt and test.txt to --data_dir; pass the class number to the --class_num.

Both train.txt and test.txt are expected to have the format {label}\t{text} per line.

You can change the delimiter according to you dataset by changing delimiter in load_custom_dataset() in data.py.

Core symbols most depended-on inside this repo

process

called by 16

original_codebase/data.py

get_compressed_len

called by 11

original_codebase/compressors.py

_compress

called by 9

npc_gzip/compressors/base.py

predict

called by 8

npc_gzip/knn_classifier.py

get_compressed_length

called by 7

npc_gzip/compressors/base.py

read_torch_text_labels

called by 5

original_codebase/data.py

Shape

Method 82

Function 45

Class 25

Languages

Python100%

Modules by API surface

original_codebase/data.py20 symbols

npc_gzip/exceptions.py18 symbols

original_codebase/utils.py12 symbols

tests/test_knn_classifier.py11 symbols

tests/test_distance.py11 symbols

original_codebase/experiments.py10 symbols

npc_gzip/distance.py10 symbols

tests/test_base_compressor.py6 symbols

npc_gzip/knn_classifier.py6 symbols

npc_gzip/compressors/base.py6 symbols

tests/test_lzma_compressor.py4 symbols

tests/test_gzip_compressor.py4 symbols

Dependencies from manifests, versioned

Unidecode1.3.6 · 1×

datasets2.13.1 · 1×

numpy1.21.6 · 1×

pathos0.3.0 · 1×

scikit-learn1.0.2 · 1×

scipy1.7.3 · 1×

torch1.13.1 · 1×

torchdata0.5.1 · 1×

torchtext0.14.1 · 1×

tqdm4.65.0 · 1×

For agents

$ claude mcp add npc_gzip \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact