hub / github.com/shibing624/text2vec

github.com/shibing624/text2vec @1.2.9 sqlite

repository ↗ · DeepWiki ↗ · release 1.2.9 ↗

356 symbols 1,405 edges 63 files 117 documented · 33%

README

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

Text2vec: Text to Vector

Text2vec: Text to Vector, Get Sentence Embeddings. Text vectorization, representing text (including words, sentences, paragraphs) as a vector matrix.

text2vec implements Word2Vec, RankBM25, BERT, Sentence-BERT, CoSENT and other text representation and text similarity calculation models, and compares the effects of each model on the text semantic matching (similarity calculation) task.

Guide - Feature - Evaluation - Install - Usage - Contact - Reference

Feature

文本向量表示模型

Word2Vec: large-scale high-quality Chinese word vector data (8 million Chinese words light weight) through Tencent AI Lab open source version) (file name: light_Tencent_AILab_ChineseEmbedding.bin password: tawe) to achieve word vector retrieval, this project realizes word2vec vector representation of sentences (word vector average)
SBERT(Sentence-BERT): A sentence vector representation model that balances performance and efficiency, and supervises the upper layer during training Classification function, direct sentence vector as cosine when text matching prediction, this project reproduces the training and prediction of Sentence-BERT model based on PyTorch
CoSENT(Cosine Sentence): The CoSENT model proposes a sorted loss function to make the training process closer to the prediction, The model convergence speed and effect are better than Sentence-BERT. This project implements the training and prediction of the CoSENT model based on PyTorch

Evaluation

文本匹配

英文匹配数据集的评测结果：

Arch	Backbone	Model Name	English-STS-B
GloVe	glove	Avg_word_embeddings_glove_6B_300d	61.77
BERT	bert-base-uncased	BERT-base-cls	20.29
BERT	bert-base-uncased	BERT-base-first_last_avg	59.04
BERT	bert-base-uncased	BERT-base-first_last_avg-whiten(NLI)	63.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-cls	73.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-first_last_avg	77.96
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	84.42
CoSENT	bert-base-uncased	CoSENT-base-first_last_avg	69.93
CoSENT	sentence-transformers/bert-base-nli-mean-tokens	CoSENT-base-nli-first_last_avg	79.68

Evaluation results of Chinese matching dataset:

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
CoSENT	hfl/chinese-macbert-base	CoSENT-macbert-base	50.39	72.93	79.17	60.86	80.51	68.77	3008
CoSENT	Langboat/mengzi-bert-base	CoSENT-mengzi-base	50.52	72.27	78.69	12.89	80.15	58.90	2502
CoSENT	bert-base-chinese	CoSENT-bert-base	49.74	72.38	78.69	60.00	80.14	68.19	2653
SBERT	bert-base-chinese	SBERT-bert-base	46.36	70.36	78.72	46.86	66.41	61.74	3365
SBERT	hfl/chinese-macbert-base	SBERT-macbert-base	47.28	68.63	79.42	55.59	64.82	63.15	2948
CoSENT	hfl/chinese-roberta-wwm-ext	CoSENT-roberta-ext	50.81	71.45	79.31	61.56	81.13	68.85	-
SBERT	hfl/chinese-roberta-wwm-ext	SBERT-roberta-ext	48.29	69.99	79.22	44.10	72.42	62.80	-

Chinese matching evaluation results of the release model of this project:

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	33.86	23769
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	41.99	3138
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	48.25	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	48.08	1046

Evaluation conclusion: - The result values are all using the spearman coefficient - The results only use the train training of the data set, and evaluate the performance obtained on the test, without using external data - shibing624/text2vec-base-chinese model is trained with CoSENT method, based on MacBERT in Chinese STS-B data training, and in Chinese STS -B test set evaluation reaches SOTA, run examples/training_sup_text_matching_model.py code to train the model, the model file has been uploaded to huggingface The model library shibing624/text2vec-base-chinese, recommended for Chinese semantic matching tasks - The SBERT-macbert-base model is trained with the SBERT method, and the code can be trained by running examples/training_sup_text_matching_model.py Model - paraphrase-multilingual-MiniLM-L12-v2 model name is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, trained with SBERT, is a multilingual version of the paraphrase-MiniLM-L12-v2 model, supports Chinese, English, etc. - w2v-light-tencent-chinese is the Word2Vec model of Tencent word vectors, which is loaded and used by CPU, and is suitable for Chinese literal matching tasks and cold start situations with lack of data - Each pre-trained model can be called through transformers, such as MacBERT model: --model_name hfl/chinese-macbert-base or roberta model: --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall - Chinese matching data set download [link below] (#data set) - The Chinese matching task experiment shows that the optimal pooling is first_last_avg, that is, EncoderType.FIRST_LAST_AVG of SentenceModel, which has little difference in prediction effect from the method of EncoderType.MEAN - Chinese matching evaluation results are reappearing, you can download the Chinese matching dataset to examples/data, run tests/test_model_spearman.py code to reproduce the evaluation results - The GPU test environment of QPS is Tesla V100 with 32GB of video memory

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

Compute text vectors based on pretrained model:

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT)，中文语义匹配任务推荐，支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型（Sentence-BERT），英文语义匹配任务推荐，支持fine-tune继续训练
    sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec)，中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ...

The return value embeddings is of numpy.ndarray type, and the shape is (sentences_size, model_embedding_size). You can choose one of the three models, and the first one is recommended.
The shibing624/text2vec-base-chinese model is trained by the CoSENT method on the Chinese STS-B dataset, and the model has been uploaded to huggingface Model library shibing624/text2vec-base-chinese, It is the default model specified by text2vec.SentenceModel, which can be called by the above example, or by transformers library as shown below, The model is automatically downloaded to the local path: ~/.cache/huggingface/transformers
The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model is a multilingual sentence vector model of Sentence-BERT, Suitable for paraphrase recognition and text matching, the model can be called through text2vec.SentenceModel and sentence-transformers library
w2v-light-tencent-chinese is a Word2Vec model loaded by gensim, using the Tencent word vector Tencent_AILab_ChineseEmbedding.tar.gz to calculate the word vector of each word, and the sentence vector through the word word The average value of the vector is obtained, and the model is automatically downloaded to the local path: ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load mo

Core symbols most depended-on inside this repo

text2vec/utils/stats_util.py

cos_sim

called by 16

text2vec/similarity.py

get_sentence_embeddings

called by 8

text2vec/sentence_model.py

encode

called by 7

text2vec/sentence_model.py

load_text_matching_test_data

called by 7

text2vec/text_matching_dataset.py

load_jsonl

called by 7

text2vec/utils/io_util.py

save_model

called by 6

text2vec/sentence_model.py

Shape

Method 173

Function 131

Class 50

Route 2

Languages

Python100%

Modules by API surface

examples/data/build_zh_nli_dataset.py25 symbols

text2vec/text_matching_dataset.py23 symbols

text2vec/utils/distance.py21 symbols

text2vec/bertmatching_dataset.py20 symbols

text2vec/utils/rank_bm25.py19 symbols

tests/test_model_spearman.py18 symbols

text2vec/sentence_model.py17 symbols

text2vec/utils/get_file.py14 symbols

text2vec/cosent_dataset.py13 symbols

tests/test_qps.py13 symbols

text2vec/bertmatching_model.py12 symbols

text2vec/similarity.py9 symbols

Dependencies from manifests, versioned

jieba0.39 · 1×

transformers4.6.0 · 1×

For agents

$ claude mcp add text2vec \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact