🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models
Text2vec: Text to Vector, Get Sentence Embeddings. Text vectorization, representing text (including words, sentences, paragraphs) as a vector matrix.
text2vec implements Word2Vec, RankBM25, BERT, Sentence-BERT, CoSENT and other text representation and text similarity calculation models, and compares the effects of each model on the text semantic matching (similarity calculation) task.
Guide - Feature - Evaluation - Install - Usage - Contact - Reference
| Arch | Backbone | Model Name | English-STS-B |
|---|---|---|---|
| GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
| BERT | bert-base-uncased | BERT-base-cls | 20.29 |
| BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
| BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
| SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
| CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
| CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
|---|---|---|---|---|---|---|---|---|---|
| CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | 72.93 | 79.17 | 60.86 | 80.51 | 68.77 | 3008 |
| CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | 50.52 | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
| CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
| SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 3365 |
| SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | 79.42 | 55.59 | 64.82 | 63.15 | 2948 |
| CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | 50.81 | 71.45 | 79.31 | 61.56 | 81.13 | 68.85 | - |
| SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
| Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
|---|---|---|---|---|---|---|---|---|---|
| Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 23769 |
| SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 3138 |
| CoSENT | hfl/chinese-macbert-base | shibing624/text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 48.25 | 3008 |
| CoSENT | hfl/chinese-lert-large | GanymedeNil/text2vec-large-chinese | 32.61 | 44.59 | 69.30 | 14.51 | 79.44 | 48.08 | 1046 |
Evaluation conclusion:
- The result values are all using the spearman coefficient
- The results only use the train training of the data set, and evaluate the performance obtained on the test, without using external data
- shibing624/text2vec-base-chinese model is trained with CoSENT method, based on MacBERT in Chinese STS-B data training, and in Chinese STS -B test set evaluation reaches SOTA, run examples/training_sup_text_matching_model.py code to train the model, the model file has been uploaded to huggingface The model library shibing624/text2vec-base-chinese, recommended for Chinese semantic matching tasks
- The SBERT-macbert-base model is trained with the SBERT method, and the code can be trained by running examples/training_sup_text_matching_model.py Model
- paraphrase-multilingual-MiniLM-L12-v2 model name is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, trained with SBERT, is a multilingual version of the paraphrase-MiniLM-L12-v2 model, supports Chinese, English, etc.
- w2v-light-tencent-chinese is the Word2Vec model of Tencent word vectors, which is loaded and used by CPU, and is suitable for Chinese literal matching tasks and cold start situations with lack of data
- Each pre-trained model can be called through transformers, such as MacBERT model: --model_name hfl/chinese-macbert-base or roberta model: --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
- Chinese matching data set download [link below] (#data set)
- The Chinese matching task experiment shows that the optimal pooling is first_last_avg, that is, EncoderType.FIRST_LAST_AVG of SentenceModel, which has little difference in prediction effect from the method of EncoderType.MEAN
- Chinese matching evaluation results are reappearing, you can download the Chinese matching dataset to examples/data, run tests/test_model_spearman.py code to reproduce the evaluation results
- The GPU test environment of QPS is Tesla V100 with 32GB of video memory
Official Demo: https://www.mulanai.com/product/short_text_sim/
HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:
python examples/gradio_demo.py
pip install torch # conda install pytorch
pip install -U text2vec
or
pip install torch # conda install pytorch
pip install -r requirements.txt
git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .
Compute text vectors based on pretrained model:
>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)
example: examples/computing_embeddings_demo.py
import sys
sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec
def compute_emb(model):
# Embed a list of sentences
sentences = [
'卡',
'银行卡',
'如何更换花呗绑定银行卡',
'花呗更改绑定银行卡',
'This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.'
]
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)
# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
if __name__ == "__main__":
# 中文句向量模型(CoSENT),中文语义匹配任务推荐,支持fine-tune继续训练
t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
compute_emb(t2v_model)
# 支持多语言的句向量模型(Sentence-BERT),英文语义匹配任务推荐,支持fine-tune继续训练
sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
compute_emb(sbert_model)
# 中文词向量模型(word2vec),中文字面匹配任务和冷启动适用
w2v_model = Word2Vec("w2v-light-tencent-chinese")
compute_emb(w2v_model)
output:
<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)
Sentence: 银行卡
Embedding shape: (768,)
...
embeddings is of numpy.ndarray type, and the shape is (sentences_size, model_embedding_size). You can choose one of the three models, and the first one is recommended.shibing624/text2vec-base-chinese model is trained by the CoSENT method on the Chinese STS-B dataset, and the model has been uploaded to huggingface
Model library shibing624/text2vec-base-chinese,
It is the default model specified by text2vec.SentenceModel, which can be called by the above example, or by transformers library as shown below,
The model is automatically downloaded to the local path: ~/.cache/huggingface/transformerssentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model is a multilingual sentence vector model of Sentence-BERT,
Suitable for paraphrase recognition and text matching, the model can be called through text2vec.SentenceModel and sentence-transformers libraryw2v-light-tencent-chinese is a Word2Vec model loaded by gensim, using the Tencent word vector Tencent_AILab_ChineseEmbedding.tar.gz to calculate the word vector of each word, and the sentence vector through the word word
The average value of the vector is obtained, and the model is automatically downloaded to the local path: ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.binWithout text2vec, you can use the model like this:
First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
example: examples/use_origin_transformers_demo.py
import os
import torch
from transformers import AutoTokenizer, AutoModel
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
sentence-transformers is a popular library to compute dense vector representations for sentences.
Install sentence-transformers:
pip install -U sentence-transformers
Then load mo
$ claude mcp add text2vec \
-- python -m otcore.mcp_server <graph>