MCPcopy Index your code
hub / github.com/shibing624/text2vec

github.com/shibing624/text2vec @1.2.9 sqlite

repository ↗ · DeepWiki ↗ · release 1.2.9 ↗
356 symbols 1,405 edges 63 files 117 documented · 33%
README

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

Logo


Text2vec: Text to Vector

PyPI version Downloads Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

Text2vec: Text to Vector, Get Sentence Embeddings. Text vectorization, representing text (including words, sentences, paragraphs) as a vector matrix.

text2vec implements Word2Vec, RankBM25, BERT, Sentence-BERT, CoSENT and other text representation and text similarity calculation models, and compares the effects of each model on the text semantic matching (similarity calculation) task.

Guide - Feature - Evaluation - Install - Usage - Contact - Reference

Feature

文本向量表示模型

  • Word2Vec: large-scale high-quality Chinese word vector data (8 million Chinese words light weight) through Tencent AI Lab open source version) (file name: light_Tencent_AILab_ChineseEmbedding.bin password: tawe) to achieve word vector retrieval, this project realizes word2vec vector representation of sentences (word vector average)
  • SBERT(Sentence-BERT): A sentence vector representation model that balances performance and efficiency, and supervises the upper layer during training Classification function, direct sentence vector as cosine when text matching prediction, this project reproduces the training and prediction of Sentence-BERT model based on PyTorch
  • CoSENT(Cosine Sentence): The CoSENT model proposes a sorted loss function to make the training process closer to the prediction, The model convergence speed and effect are better than Sentence-BERT. This project implements the training and prediction of the CoSENT model based on PyTorch

Evaluation

文本匹配

  • 英文匹配数据集的评测结果:
Arch Backbone Model Name English-STS-B
GloVe glove Avg_word_embeddings_glove_6B_300d 61.77
BERT bert-base-uncased BERT-base-cls 20.29
BERT bert-base-uncased BERT-base-first_last_avg 59.04
BERT bert-base-uncased BERT-base-first_last_avg-whiten(NLI) 63.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-cls 73.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-first_last_avg 77.96
SBERT xlm-roberta-base paraphrase-multilingual-MiniLM-L12-v2 84.42
CoSENT bert-base-uncased CoSENT-base-first_last_avg 69.93
CoSENT sentence-transformers/bert-base-nli-mean-tokens CoSENT-base-nli-first_last_avg 79.68
  • Evaluation results of Chinese matching dataset:
Arch Backbone Model Name ATEC BQ LCQMC PAWSX STS-B Avg QPS
CoSENT hfl/chinese-macbert-base CoSENT-macbert-base 50.39 72.93 79.17 60.86 80.51 68.77 3008
CoSENT Langboat/mengzi-bert-base CoSENT-mengzi-base 50.52 72.27 78.69 12.89 80.15 58.90 2502
CoSENT bert-base-chinese CoSENT-bert-base 49.74 72.38 78.69 60.00 80.14 68.19 2653
SBERT bert-base-chinese SBERT-bert-base 46.36 70.36 78.72 46.86 66.41 61.74 3365
SBERT hfl/chinese-macbert-base SBERT-macbert-base 47.28 68.63 79.42 55.59 64.82 63.15 2948
CoSENT hfl/chinese-roberta-wwm-ext CoSENT-roberta-ext 50.81 71.45 79.31 61.56 81.13 68.85 -
SBERT hfl/chinese-roberta-wwm-ext SBERT-roberta-ext 48.29 69.99 79.22 44.10 72.42 62.80 -
  • Chinese matching evaluation results of the release model of this project:
Arch Backbone Model Name ATEC BQ LCQMC PAWSX STS-B Avg QPS
Word2Vec word2vec w2v-light-tencent-chinese 20.00 31.49 59.46 2.57 55.78 33.86 23769
SBERT xlm-roberta-base paraphrase-multilingual-MiniLM-L12-v2 18.42 38.52 63.96 10.14 78.90 41.99 3138
CoSENT hfl/chinese-macbert-base shibing624/text2vec-base-chinese 31.93 42.67 70.16 17.21 79.30 48.25 3008
CoSENT hfl/chinese-lert-large GanymedeNil/text2vec-large-chinese 32.61 44.59 69.30 14.51 79.44 48.08 1046

Evaluation conclusion: - The result values are all using the spearman coefficient - The results only use the train training of the data set, and evaluate the performance obtained on the test, without using external data - shibing624/text2vec-base-chinese model is trained with CoSENT method, based on MacBERT in Chinese STS-B data training, and in Chinese STS -B test set evaluation reaches SOTA, run examples/training_sup_text_matching_model.py code to train the model, the model file has been uploaded to huggingface The model library shibing624/text2vec-base-chinese, recommended for Chinese semantic matching tasks - The SBERT-macbert-base model is trained with the SBERT method, and the code can be trained by running examples/training_sup_text_matching_model.py Model - paraphrase-multilingual-MiniLM-L12-v2 model name is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, trained with SBERT, is a multilingual version of the paraphrase-MiniLM-L12-v2 model, supports Chinese, English, etc. - w2v-light-tencent-chinese is the Word2Vec model of Tencent word vectors, which is loaded and used by CPU, and is suitable for Chinese literal matching tasks and cold start situations with lack of data - Each pre-trained model can be called through transformers, such as MacBERT model: --model_name hfl/chinese-macbert-base or roberta model: --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall - Chinese matching data set download [link below] (#data set) - The Chinese matching task experiment shows that the optimal pooling is first_last_avg, that is, EncoderType.FIRST_LAST_AVG of SentenceModel, which has little difference in prediction effect from the method of EncoderType.MEAN - Chinese matching evaluation results are reappearing, you can download the Chinese matching dataset to examples/data, run tests/test_model_spearman.py code to reproduce the evaluation results - The GPU test environment of QPS is Tesla V100 with 32GB of video memory

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

or

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

Compute text vectors based on pretrained model:

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT),中文语义匹配任务推荐,支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型(Sentence-BERT),英文语义匹配任务推荐,支持fine-tune继续训练
    sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec),中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ... 
  • The return value embeddings is of numpy.ndarray type, and the shape is (sentences_size, model_embedding_size). You can choose one of the three models, and the first one is recommended.
  • The shibing624/text2vec-base-chinese model is trained by the CoSENT method on the Chinese STS-B dataset, and the model has been uploaded to huggingface Model library shibing624/text2vec-base-chinese, It is the default model specified by text2vec.SentenceModel, which can be called by the above example, or by transformers library as shown below, The model is automatically downloaded to the local path: ~/.cache/huggingface/transformers
  • The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model is a multilingual sentence vector model of Sentence-BERT, Suitable for paraphrase recognition and text matching, the model can be called through text2vec.SentenceModel and sentence-transformers library
  • w2v-light-tencent-chinese is a Word2Vec model loaded by gensim, using the Tencent word vector Tencent_AILab_ChineseEmbedding.tar.gz to calculate the word vector of each word, and the sentence vector through the word word The average value of the vector is obtained, and the model is automatically downloaded to the local path: ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load mo

Core symbols most depended-on inside this repo

encode
called by 41
text2vec/ngram.py
compute_spearmanr
called by 18
text2vec/utils/stats_util.py
cos_sim
called by 16
text2vec/similarity.py
get_sentence_embeddings
called by 8
text2vec/sentence_model.py
encode
called by 7
text2vec/sentence_model.py
load_text_matching_test_data
called by 7
text2vec/text_matching_dataset.py
load_jsonl
called by 7
text2vec/utils/io_util.py
save_model
called by 6
text2vec/sentence_model.py

Shape

Method 173
Function 131
Class 50
Route 2

Languages

Python100%

Modules by API surface

examples/data/build_zh_nli_dataset.py25 symbols
text2vec/text_matching_dataset.py23 symbols
text2vec/utils/distance.py21 symbols
text2vec/bertmatching_dataset.py20 symbols
text2vec/utils/rank_bm25.py19 symbols
tests/test_model_spearman.py18 symbols
text2vec/sentence_model.py17 symbols
text2vec/utils/get_file.py14 symbols
text2vec/cosent_dataset.py13 symbols
tests/test_qps.py13 symbols
text2vec/bertmatching_model.py12 symbols
text2vec/similarity.py9 symbols

Dependencies from manifests, versioned

jieba0.39 · 1×
transformers4.6.0 · 1×

For agents

$ claude mcp add text2vec \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact