<a href="https://github.com/netease-youdao/BCEmbedding/raw/main/LICENSE">
<img src="https://img.shields.io/badge/license-Apache--2.0-yellow">
</a>
<a href="https://twitter.com/YDopensource">
<img src="https://img.shields.io/badge/follow-%40YDOpenSource-1DA1F2?logo=twitter&style={style}">
</a>
English | 简体中文
点击打开目录
transformers, sentence-transformers)langchain, llama_index)BCEmbedding是由网易有道开发的中英双语和跨语种语义表征算法模型库,其中包含 EmbeddingModel和 RerankerModel两类基础模型。EmbeddingModel专门用于生成语义向量,在语义搜索和问答中起着关键作用,而 RerankerModel擅长优化语义搜索结果和语义相关顺序精排。
BCEmbedding作为有道的检索增强生成式应用(RAG)的基石,特别是在QAnything [github]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如有道速读和有道翻译。
BCEmbedding以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:

给RAG社区一个可以直接拿来用,尽可能不需要用户finetune的中英双语和跨语种二阶段检索模型库,包含EmbeddingModel和RerankerModel。
EmbeddingModel覆盖 中英双语和中英跨语种 检索任务,尤其是其跨语种能力。RerankerModel支持 中英日韩 四个语种及其跨语种。EmbeddingModel和RerankerModel提供了LlamaIndex和LangChain 集成接口 ,用户可非常方便集成进现有产品中。RerankerModel支持 长passage(超过512 tokens,不超过32k tokens)rerank;RerankerModel可以给出有意义 相关性分数 ,帮助 过滤低质量召回;EmbeddingModel 不需要“精心设计”instruction ,尽可能召回有用片段。现有的单个语义表征模型在双语和跨语种场景中常常表现不佳,特别是在中文、英文及其跨语种任务中。BCEmbedding充分利用有道翻译引擎的优势,实现只需一个模型就可以在单语、双语和跨语种场景中表现出卓越的性能。
EmbeddingModel支持中文和英文(之后会支持更多语种);RerankerModel支持中文,英文,日文和韩文。
BCEmbedding实现强大的中英双语和跨语种语义表征能力。EmbeddingModel采用双编码器,可以在第一阶段实现高效的语义检索。RerankerModel采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。RerankerModel可以提供有意义的语义相关性分数(不仅仅是排序),可以用于过滤无意义文本片段,提高大模型生成效果。BCEmbedding已经被有道众多产品检验。| 模型名称 | 模型类型 | 支持语种 | 参数量 | 开源权重 |
|---|---|---|---|---|
| bce-embedding-base_v1 | EmbeddingModel |
中英 | 279M | Huggingface(推荐), 国内通道, ModelScope, WiseModel |
| bce-reranker-base_v1 | RerankerModel |
中英日韩 | 279M | Huggingface(推荐), 国内通道, ModelScope, WiseModel |
首先创建一个conda环境并激活
conda create --name bce python=3.10 -y
conda activate bce
然后最简化安装 BCEmbedding(为了避免自动安装的torch cuda版本和本地不兼容,建议先手动安装本地cuda版本兼容的torch):
pip install BCEmbedding==0.1.5
也可以通过项目源码安装(推荐):
git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .
BCEmbedding通过 BCEmbedding调用 EmbeddingModel。pooler默认是 cls。
from BCEmbedding import EmbeddingModel
# list of sentences
sentences = ['sentence_0', 'sentence_1']
# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")
# extract embeddings
embeddings = model.encode(sentences)
通过 BCEmbedding调用 RerankerModel可以计算句子对的语义相关分数,也可以对候选检索结果进行排序。
from BCEmbedding import RerankerModel
# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']
# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]
# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")
# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)
# method 1: rerank passages
rerank_results = model.rerank(query, passages)
注意:
RerankerModel.rerank方法中,我们提供一个query和passage的拼接方法(在实际生产服务中使用),可适用于passage很长的情况。transformersEmbeddingModel调用方法:
from transformers import AutoModel, AutoTokenizer
# list of sentences
sentences = ['sentence_0', 'sentence_1']
# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')
device = 'cuda' # if no GPU, set "cpu"
model.to(device)
# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}
# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0] # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # normalize
RerankerModel调用方法:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')
device = 'cuda' # if no GPU, set "cpu"
model.to(device)
# get inputs
inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}
# calculate scores
scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
scores = torch.sigmoid(scores)
sentence_transformersEmbeddingModel调用方法:
from sentence_transformers import SentenceTransformer
# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]
# init embedding model
## sentence-transformers支持有更新,请注意先删除本地模型缓存:"`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1"或“~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1”
model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")
# extract embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)
RerankerModel调用方法:
from sentence_transformers import CrossEncoder
# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)
# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)
langchain为了继承RerankerModel精细优化的rerank逻辑,我们提供BCERerank方法,可直接集成到langchain demo中。
pip install langchain==0.1.0
pip install langchain-community==0.0.9
pip install langchain-core==0.1.7
pip install langsmith==0.0.77
# 我们在`BCEmbedding`中提供langchain直接集成的接口。
from BCEmbedding.tools.langchain import BCERerank
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain.retrievers import ContextualCompressionRetriever
# init embedding model
embedding_model_name = 'maidalun1020/bce-embedding-base_v1'
embedding_model_kwargs = {'device': 'cuda:0'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}
embed_model = HuggingFaceEmbeddings(
model_name=embedding_model_name,
model_kwargs=embedding_model_kwargs,
encode_kwargs=embedding_encode_kwargs
)
reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker = BCERerank(**reranker_args)
# init documents
documents = PyPDFLoader("BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
# example 1. retrieval with embedding and reranker
retriever = FAISS.from_documents(texts, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.3, "k": 10})
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=retriever
)
response = compression_retriever.get_relevant_documents("What is Llama 2?")
llama_index为了继承RerankerModel精细优化的rerank逻辑,我们提供BCERerank方法,可直接集成到LlamaIndex demo中。
pip install llama-index==0.9.42.post2
BCEmbedding中提供llama_index直接集成的接口。from BCEmbedding.tools.llama_index import BCERerank
import os from llama_index.embeddings import HuggingFaceEmbedding from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader from llama_index.node_parser import SimpleNodeParser from llama_index.llms import OpenAI from llama_index.retrievers import VectorIndexRetriever
embed_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 32, 'device': 'cuda:0'} embed_model = HuggingFaceEmbedding(**embed_args)
reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'} reranker_model = BCERerank(**reranker_args)
query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] query_embedding = embed_model.get_query_embedding(query) passages_embeddings = embed_model.get_text_embedding_batch(passages)
llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL')) service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data() node_parser = SimpleNodeParser.from_defaults(chunk_size=400, chunk_overlap=80) nodes = node_parser.get_nodes_from_documents(documents[0
$ claude mcp add BCEmbedding \
-- python -m otcore.mcp_server <graph>