MCPcopy
hub / github.com/netease-youdao/BCEmbedding

github.com/netease-youdao/BCEmbedding @main sqlite

repository ↗ · DeepWiki ↗
110 symbols 454 edges 30 files 17 documented · 15%
README

BCEmbedding: Bilingual and Crosslingual Embedding for RAG

<a href="https://github.com/netease-youdao/BCEmbedding/raw/main/LICENSE">
  <img src="https://img.shields.io/badge/license-Apache--2.0-yellow">
</a>
    
<a href="https://twitter.com/YDopensource">
  <img src="https://img.shields.io/badge/follow-%40YDOpenSource-1DA1F2?logo=twitter&style={style}">
</a>

English | 简体中文

点击打开目录

BCEmbedding是由网易有道开发的中英双语和跨语种语义表征算法模型库,其中包含 EmbeddingModelRerankerModel两类基础模型。EmbeddingModel专门用于生成语义向量,在语义搜索和问答中起着关键作用,而 RerankerModel擅长优化语义搜索结果和语义相关顺序精排。

BCEmbedding作为有道的检索增强生成式应用(RAG)的基石,特别是在QAnything [github]中发挥着重要作用。QAnything作为一个网易有道开源项目,在有道许多产品中有很好的应用实践,比如有道速读有道翻译

BCEmbedding以其出色的双语和跨语种能力而著称,在语义检索中消除中英语言之间的差异,从而实现:

开源目的

给RAG社区一个可以直接拿来用,尽可能不需要用户finetune的中英双语和跨语种二阶段检索模型库,包含EmbeddingModelRerankerModel

  • 只需一个模型:EmbeddingModel覆盖 中英双语和中英跨语种 检索任务,尤其是其跨语种能力。RerankerModel支持 中英日韩 四个语种及其跨语种。
  • 只需一个模型: 覆盖常见业务落地领域(针对众多常见rag场景已做优化),比如:教育、医疗、法律、金融、科研论文、客服(FAQ)、书籍、百科、通用QA等场景。用户不需要在上述特定领域finetune,直接可以用。
  • 方便集成:EmbeddingModelRerankerModel提供了LlamaIndex和LangChain 集成接口 ,用户可非常方便集成进现有产品中。
  • 其他特性:
  • RerankerModel支持 长passage(超过512 tokens,不超过32k tokens)rerank
  • RerankerModel可以给出有意义 相关性分数 ,帮助 过滤低质量召回
  • EmbeddingModel 不需要“精心设计”instruction ,尽可能召回有用片段。

典型案例

🌐 双语和跨语种优势

现有的单个语义表征模型在双语和跨语种场景中常常表现不佳,特别是在中文、英文及其跨语种任务中。BCEmbedding充分利用有道翻译引擎的优势,实现只需一个模型就可以在单语、双语和跨语种场景中表现出卓越的性能。

EmbeddingModel支持中文和英文(之后会支持更多语种);RerankerModel支持中文,英文,日文和韩文

💡 主要特点

  • 双语和跨语种能力:基于有道翻译引擎的强大能力,BCEmbedding实现强大的中英双语和跨语种语义表征能力。
  • RAG适配:面向RAG做针对性优化,可适配大多数相关任务,比如翻译,摘要,问答等。此外,针对 问题理解(query understanding) 也做了针对优化。详见 基于LlamaIndex的RAG评测指标
  • 高效且精确的语义检索EmbeddingModel采用双编码器,可以在第一阶段实现高效的语义检索。RerankerModel采用交叉编码器,可以在第二阶段实现更高精度的语义顺序精排。
  • 更好的领域泛化性:为了在更多场景实现更好的效果,我们收集了多种多样的领域数据。
  • 用户友好:语义检索时不需要特殊指令前缀。也就是,你不需要为各种任务绞尽脑汁设计指令前缀。
  • 有意义的重排序分数RerankerModel可以提供有意义的语义相关性分数(不仅仅是排序),可以用于过滤无意义文本片段,提高大模型生成效果。
  • 产品化检验BCEmbedding已经被有道众多产品检验。

🚀 最新更新

🍎 模型列表

模型名称 模型类型 支持语种 参数量 开源权重
bce-embedding-base_v1 EmbeddingModel 中英 279M Huggingface(推荐), 国内通道, ModelScope, WiseModel
bce-reranker-base_v1 RerankerModel 中英日韩 279M Huggingface(推荐), 国内通道, ModelScope, WiseModel

📖 使用指南

安装

首先创建一个conda环境并激活

conda create --name bce python=3.10 -y
conda activate bce

然后最简化安装 BCEmbedding(为了避免自动安装的torch cuda版本和本地不兼容,建议先手动安装本地cuda版本兼容的torch):

pip install BCEmbedding==0.1.5

也可以通过项目源码安装(推荐):

git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

快速使用

1. 基于 BCEmbedding

通过 BCEmbedding调用 EmbeddingModelpooler默认是 cls

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

通过 BCEmbedding调用 RerankerModel可以计算句子对的语义相关分数,也可以对候选检索结果进行排序。

from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']

# construct sentence pairs
sentence_pairs = [[query, passage] for passage in passages]

# init reranker model
model = RerankerModel(model_name_or_path="maidalun1020/bce-reranker-base_v1")

# method 0: calculate scores of sentence pairs
scores = model.compute_score(sentence_pairs)

# method 1: rerank passages
rerank_results = model.rerank(query, passages)

注意:

  • RerankerModel.rerank方法中,我们提供一个query和passage的拼接方法(在实际生产服务中使用),可适用于passage很长的情况。

2. 基于 transformers

EmbeddingModel调用方法:

from transformers import AutoModel, AutoTokenizer

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# get embeddings
outputs = model(**inputs_on_device, return_dict=True)
embeddings = outputs.last_hidden_state[:, 0]  # cls pooler
embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)  # normalize

RerankerModel调用方法:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# init model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')

device = 'cuda'  # if no GPU, set "cpu"
model.to(device)

# get inputs
inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
inputs_on_device = {k: v.to(device) for k, v in inputs.items()}

# calculate scores
scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
scores = torch.sigmoid(scores)

3. 基于 sentence_transformers

EmbeddingModel调用方法:

from sentence_transformers import SentenceTransformer

# list of sentences
sentences = ['sentence_0', 'sentence_1', ...]

# init embedding model
## sentence-transformers支持有更新,请注意先删除本地模型缓存:"`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1"或“~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1”
model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences, normalize_embeddings=True)

RerankerModel调用方法:

from sentence_transformers import CrossEncoder

# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)

# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)

Embedding和Reranker集成常用RAG框架

1. 使用 langchain

为了继承RerankerModel精细优化的rerank逻辑,我们提供BCERerank方法,可直接集成到langchain demo中。

  • 先安装langchain
pip install langchain==0.1.0
pip install langchain-community==0.0.9
pip install langchain-core==0.1.7
pip install langsmith==0.0.77
  • 样例代码
# 我们在`BCEmbedding`中提供langchain直接集成的接口。
from BCEmbedding.tools.langchain import BCERerank

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain.retrievers import ContextualCompressionRetriever


# init embedding model
embedding_model_name = 'maidalun1020/bce-embedding-base_v1'
embedding_model_kwargs = {'device': 'cuda:0'}
embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}

embed_model = HuggingFaceEmbeddings(
  model_name=embedding_model_name,
  model_kwargs=embedding_model_kwargs,
  encode_kwargs=embedding_encode_kwargs
)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'}
reranker = BCERerank(**reranker_args)

# init documents
documents = PyPDFLoader("BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# example 1. retrieval with embedding and reranker
retriever = FAISS.from_documents(texts, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.3, "k": 10})

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=retriever
)

response = compression_retriever.get_relevant_documents("What is Llama 2?")

2. 使用 llama_index

为了继承RerankerModel精细优化的rerank逻辑,我们提供BCERerank方法,可直接集成到LlamaIndex demo中。

  • 先安装llama_index
pip install llama-index==0.9.42.post2
  • 样例代码 ```python

我们在BCEmbedding中提供llama_index直接集成的接口。

from BCEmbedding.tools.llama_index import BCERerank

import os from llama_index.embeddings import HuggingFaceEmbedding from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader from llama_index.node_parser import SimpleNodeParser from llama_index.llms import OpenAI from llama_index.retrievers import VectorIndexRetriever

init embedding model and reranker model

embed_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 32, 'device': 'cuda:0'} embed_model = HuggingFaceEmbedding(**embed_args)

reranker_args = {'model': 'maidalun1020/bce-reranker-base_v1', 'top_n': 5, 'device': 'cuda:1'} reranker_model = BCERerank(**reranker_args)

example #1. extract embeddings

query = 'apples' passages = [ 'I like apples', 'I like oranges', 'Apples and oranges are fruits' ] query_embedding = embed_model.get_query_embedding(query) passages_embeddings = embed_model.get_text_embedding_batch(passages)

example #2. rag example

llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL')) service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data() node_parser = SimpleNodeParser.from_defaults(chunk_size=400, chunk_overlap=80) nodes = node_parser.get_nodes_from_documents(documents[0

Core symbols most depended-on inside this repo

logger_wrapper
called by 9
BCEmbedding/utils/logger.py
encode
called by 4
BCEmbedding/evaluation/c_mteb/yd_dres_model.py
process_items
called by 2
Docs/RyzenAI/export_to_onnx.py
display_results
called by 2
BCEmbedding/tools/eval_rag/utils.py
extract_data_from_pdf
called by 2
BCEmbedding/tools/eval_rag/utils.py
_merge_inputs
called by 2
BCEmbedding/models/utils.py
rerank
called by 2
BCEmbedding/models/reranker.py
encode_queries
called by 2
BCEmbedding/evaluation/c_mteb/yd_dres_model.py

Shape

Method 58
Class 27
Function 25

Languages

Python100%

Modules by API surface

BCEmbedding/evaluation/c_mteb/Retrieval.py39 symbols
BCEmbedding/evaluation/c_mteb/Reranking.py13 symbols
Docs/RyzenAI/test_perf_accuray.py10 symbols
BCEmbedding/tools/eval_rag/utils.py9 symbols
Docs/RyzenAI/export_to_onnx.py5 symbols
BCEmbedding/tools/langchain/bce_rerank.py5 symbols
BCEmbedding/evaluation/c_mteb/yd_dres_model.py5 symbols
BCEmbedding/tools/llama_index/bce_rerank.py4 symbols
BCEmbedding/models/reranker.py4 symbols
BCEmbedding/tools/eval_rag/summarize_eval_results.py3 symbols
BCEmbedding/tools/eval_mteb/summarize_eval_results.py3 symbols
BCEmbedding/models/embedding.py3 symbols

For agents

$ claude mcp add BCEmbedding \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact