Using BERT model as a sentence encoding service, i.e. mapping a variable-length sentence to a fixed-length vector.
Highlights • What is it • Install • Getting Started • API • Tutorials • FAQ • Benchmark • Blog
<img src="https://github.com/jina-ai/clip-as-service/raw/v1.10.0/github/demo.gif?raw=true" width="700">
|
|
✨Looking for X-as-service? Or more generic and cloud-native solution? Checkout my new project GNES! GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network. GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form. |
BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.
Sentence Encoding/Embedding is a upstream task required in many NLP applications, e.g. sentiment analysis, text classification. The goal is to represent a variable length sentence into a fixed length vector, e.g. hello world to [0.1, 0.3, 0.9]. Each element of the vector should "encode" some semantics of the original sentence.
Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.
More features: XLA & FP16 support; mix GPU-CPU workloads; optimized graph; tf.data friendly; customized tokenizer; flexible pooling strategy; build-in HTTP server and dashboard; async encoding; multicasting; etc.
Install the server and client via pip. They can be installed separately or even on different machines:
pip install bert-serving-server # server
pip install bert-serving-client # client, independent of `bert-serving-server`
Note that the server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!
:point_up: The client can be running on both Python 2 and 3 for the following consideration.
Download a model listed below, then uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/
List of released pretrained BERT models (click to expand...)
| BERT-Base, Uncased | 12-layer, 768-hidden, 12-heads, 110M parameters |
| BERT-Large, Uncased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
| BERT-Base, Cased | 12-layer, 768-hidden, 12-heads , 110M parameters |
| BERT-Large, Cased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
| BERT-Base, Multilingual Cased (New) | 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
| BERT-Base, Multilingual Cased (Old) | 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
| BERT-Base, Chinese | Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters |
Optional: fine-tuning the model on your downstream task. Why is it optional?
After installing the server, you should be able to use bert-serving-start CLI as follows:
bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4
This will start a service with four workers, meaning that it can handle up to four concurrent requests. More concurrent requests will be queued in a load balancer. Details can be found in our FAQ and the benchmark on number of clients.
Below shows what the server looks like when starting correctly:

Alternatively, one can start the BERT Service in a Docker Container (click to expand...)
docker build -t bert-as-service -f ./docker/Dockerfile .
NUM_WORKER=1
PATH_MODEL=/PATH_TO/_YOUR_MODEL/
docker run --runtime nvidia -dit -p 5555:5555 -p 5556:5556 -v $PATH_MODEL:/model -t bert-as-service $NUM_WORKER
Now you can encode sentences simply as follows:
from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])
It will return a ndarray (or List[List[float]] if you wish), in which each row is a fixed-length vector representing a sentence. Having thousands of sentences? Just encode! Don't even bother to batch, the server will take care of it.
As a feature of BERT, you may get encodes of a pair of sentences by concatenating them with ||| (with whitespace before and after), e.g.
bc.encode(['First do it ||| then do it right'])
Below shows what the server looks like while encoding:

One may also start the service on one (GPU) machine and call it from another (CPU) machine as follows:
# on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx') # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])
Note that you only need pip install -U bert-serving-client in this case, the server side is not required. You may also call the service via HTTP requests.
:bulb: Want to learn more? Checkout our tutorials: - Building a QA semantic search engine in 3 min. - Serving a fine-tuned BERT model - Getting ELMo-like contextual word embedding - Using your own tokenizer - Using
BertClientwithtf.dataAPI - Training a text classifier using BERT features and tf.estimator API - Saving and loading with TFRecord data - Asynchronous encoding - Broadcasting to multiple clients - Monitoring the service status in a dashboard - Usingbert-as-serviceto serve HTTP requests in JSON - StartingBertServerfrom Python
The best way to learn bert-as-service latest API is reading the documentation.
Please always refer to the latest server-side API documented here., you may get the latest usage via:
bert-serving-start --help
bert-serving-terminate --help
bert-serving-benchmark --help
| Argument | Type | Default | Description |
|---|---|---|---|
model_dir |
str | Required | folder path of the pre-trained BERT model. |
tuned_model_dir |
str | (Optional) | folder path of a fine-tuned BERT model. |
ckpt_name |
str | bert_model.ckpt |
filename of the checkpoint file. |
config_name |
str | bert_config.json |
filename of the JSON config file for BERT model. |
graph_tmp_dir |
str | None | path to graph temp file |
max_seq_len |
int | 25 |
maximum length of sequence, longer sequence will be trimmed on the right side. Set it to NONE for dynamically using the longest sequence in a (mini)batch. |
cased_tokenization |
bool | False | Whether tokenizer should skip the default lowercasing and accent removal. Should be used for e.g. the multilingual cased pretrained BERT model. |
mask_cls_sep |
bool | False | masking the embedding on [CLS] and [SEP] with zero. |
num_worker |
int | 1 |
number of (GPU/CPU) worker runs BERT model, each works in a separate process. |
max_batch_size |
int | 256 |
maximum number of sequences handled by each worker, larger batch will be partitioned into small batches. |
priority_batch_size |
int | 16 |
batch smaller than this size will be labeled as high priority, and jumps forward in the job queue to get result faster |
port |
int | 5555 |
port for pushing data from client to server |
port_out |
int | 5556 |
port for publishing results from server to client |
http_port |
int | None | server port for receiving HTTP requests |
cors |
str | * |
setting "Access-Control-Allow-Origin" for HTTP requests |
pooling_strategy |
str | REDUCE_MEAN |
the pooling strategy for generating encoding vectors, valid values are NONE, REDUCE_MEAN, REDUCE_MAX, REDUCE_MEAN_MAX, CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN. Explanation of these strategies can be found here. To get encoding for each token in the sequence, please set this to NONE. |
| `pooling |
$ claude mcp add clip-as-service \
-- python -m otcore.mcp_server <graph>