MCPcopy
hub / github.com/jina-ai/clip-as-service

github.com/jina-ai/clip-as-service @v1.10.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.10.0 ↗
224 symbols 638 edges 30 files 70 documented · 31%
README

bert-as-service

Using BERT model as a sentence encoding service, i.e. mapping a variable-length sentence to a fixed-length vector.

GitHub stars Pypi package ReadTheDoc PyPI - Downloads GitHub issues GitHub license Twitter

HighlightsWhat is itInstallGetting StartedAPITutorialsFAQBenchmarkBlog

<img src="https://github.com/jina-ai/clip-as-service/raw/v1.10.0/github/demo.gif?raw=true" width="700">
Made by Han Xiao • :globe_with_meridians: https://hanxiao.github.io
GNES is Generic Neural Elastic Search (logo made by Han Xiao) ✨Looking for X-as-service? Or more generic and cloud-native solution? Checkout my new project GNES! GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network. GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form.

What is it

BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.

Sentence Encoding/Embedding is a upstream task required in many NLP applications, e.g. sentiment analysis, text classification. The goal is to represent a variable length sentence into a fixed length vector, e.g. hello world to [0.1, 0.3, 0.9]. Each element of the vector should "encode" some semantics of the original sentence.

Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

Highlights

  • :telescope: State-of-the-art: build on pretrained 12/24-layer BERT models released by Google AI, which is considered as a milestone in the NLP community.
  • :hatching_chick: Easy-to-use: require only two lines of code to get sentence/token-level encodes.
  • :zap: Fast: 900 sentences/s on a single Tesla M40 24GB. Low latency, optimized for speed. See benchmark.
  • :octopus: Scalable: scale nicely and smoothly on multiple GPUs and multiple clients without worrying about concurrency. See benchmark.
  • :gem: Reliable: tested on multi-billion sentences; days of running without a break or OOM or any nasty exceptions.

More features: XLA & FP16 support; mix GPU-CPU workloads; optimized graph; tf.data friendly; customized tokenizer; flexible pooling strategy; build-in HTTP server and dashboard; async encoding; multicasting; etc.

Install

Install the server and client via pip. They can be installed separately or even on different machines:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Note that the server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!

:point_up: The client can be running on both Python 2 and 3 for the following consideration.

Getting Started

1. Download a Pre-trained BERT Model

Download a model listed below, then uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/

List of released pretrained BERT models (click to expand...)

BERT-Base, Uncased12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, ChineseChinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Optional: fine-tuning the model on your downstream task. Why is it optional?

2. Start the BERT service

After installing the server, you should be able to use bert-serving-start CLI as follows:

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4 

This will start a service with four workers, meaning that it can handle up to four concurrent requests. More concurrent requests will be queued in a load balancer. Details can be found in our FAQ and the benchmark on number of clients.

Below shows what the server looks like when starting correctly:

Alternatively, one can start the BERT Service in a Docker Container (click to expand...)

docker build -t bert-as-service -f ./docker/Dockerfile .
NUM_WORKER=1
PATH_MODEL=/PATH_TO/_YOUR_MODEL/
docker run --runtime nvidia -dit -p 5555:5555 -p 5556:5556 -v $PATH_MODEL:/model -t bert-as-service $NUM_WORKER

3. Use Client to Get Sentence Encodes

Now you can encode sentences simply as follows:

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])

It will return a ndarray (or List[List[float]] if you wish), in which each row is a fixed-length vector representing a sentence. Having thousands of sentences? Just encode! Don't even bother to batch, the server will take care of it.

As a feature of BERT, you may get encodes of a pair of sentences by concatenating them with ||| (with whitespace before and after), e.g.

bc.encode(['First do it ||| then do it right'])

Below shows what the server looks like while encoding:

Use BERT Service Remotely

One may also start the service on one (GPU) machine and call it from another (CPU) machine as follows:

# on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx')  # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

Note that you only need pip install -U bert-serving-client in this case, the server side is not required. You may also call the service via HTTP requests.

:bulb: Want to learn more? Checkout our tutorials: - Building a QA semantic search engine in 3 min. - Serving a fine-tuned BERT model - Getting ELMo-like contextual word embedding - Using your own tokenizer - Using BertClient with tf.data API - Training a text classifier using BERT features and tf.estimator API - Saving and loading with TFRecord data - Asynchronous encoding - Broadcasting to multiple clients - Monitoring the service status in a dashboard - Using bert-as-service to serve HTTP requests in JSON - Starting BertServer from Python

Server and Client API

▴ Back to top

ReadTheDoc

The best way to learn bert-as-service latest API is reading the documentation.

Server API

Please always refer to the latest server-side API documented here., you may get the latest usage via:

bert-serving-start --help
bert-serving-terminate --help
bert-serving-benchmark --help
Argument Type Default Description
model_dir str Required folder path of the pre-trained BERT model.
tuned_model_dir str (Optional) folder path of a fine-tuned BERT model.
ckpt_name str bert_model.ckpt filename of the checkpoint file.
config_name str bert_config.json filename of the JSON config file for BERT model.
graph_tmp_dir str None path to graph temp file
max_seq_len int 25 maximum length of sequence, longer sequence will be trimmed on the right side. Set it to NONE for dynamically using the longest sequence in a (mini)batch.
cased_tokenization bool False Whether tokenizer should skip the default lowercasing and accent removal. Should be used for e.g. the multilingual cased pretrained BERT model.
mask_cls_sep bool False masking the embedding on [CLS] and [SEP] with zero.
num_worker int 1 number of (GPU/CPU) worker runs BERT model, each works in a separate process.
max_batch_size int 256 maximum number of sequences handled by each worker, larger batch will be partitioned into small batches.
priority_batch_size int 16 batch smaller than this size will be labeled as high priority, and jumps forward in the job queue to get result faster
port int 5555 port for pushing data from client to server
port_out int 5556 port for publishing results from server to client
http_port int None server port for receiving HTTP requests
cors str * setting "Access-Control-Allow-Origin" for HTTP requests
pooling_strategy str REDUCE_MEAN the pooling strategy for generating encoding vectors, valid values are NONE, REDUCE_MEAN, REDUCE_MAX, REDUCE_MEAN_MAX, CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN. Explanation of these strategies can be found here. To get encoding for each token in the sequence, please set this to NONE.
`pooling

Core symbols most depended-on inside this repo

info
called by 34
server/bert_serving/server/helper.py
encode
called by 16
client/bert_serving/client/__init__.py
create_initializer
called by 10
server/bert_serving/server/bert/modeling.py
get_shape_list
called by 9
server/bert_serving/server/bert/modeling.py
set_logger
called by 8
server/bert_serving/server/helper.py
warning
called by 7
server/bert_serving/server/helper.py
error
called by 6
server/bert_serving/server/helper.py
patch_dtype
called by 5
server/bert_serving/server/graph.py

Shape

Method 135
Function 61
Class 25
Route 3

Languages

Python100%

Modules by API surface

server/bert_serving/server/__init__.py43 symbols
client/bert_serving/client/__init__.py38 symbols
server/bert_serving/server/bert/modeling.py30 symbols
server/bert_serving/server/helper.py29 symbols
server/bert_serving/server/bert/tokenization.py27 symbols
server/bert_serving/server/http.py10 symbols
server/bert_serving/server/zmq_decor.py9 symbols
server/bert_serving/server/bert/extract_features.py8 symbols
server/bert_serving/server/graph.py6 symbols
server/bert_serving/server/bert/optimization.py6 symbols
server/bert_serving/server/benchmark.py4 symbols
server/bert_serving/server/cli/__init__.py3 symbols

Dependencies from manifests, versioned

GPUtil1.3.0 · 1×
pyzmq17.1.0 · 1×

For agents

$ claude mcp add clip-as-service \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact