hub / github.com/jina-ai/clip-as-service

github.com/jina-ai/clip-as-service @v1.10.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.10.0 ↗

224 symbols 638 edges 30 files 70 documented · 31%

README

bert-as-service

Using BERT model as a sentence encoding service, i.e. mapping a variable-length sentence to a fixed-length vector.

Highlights • What is it • Install • Getting Started • API • Tutorials • FAQ • Benchmark • Blog

<img src="https://github.com/jina-ai/clip-as-service/raw/v1.10.0/github/demo.gif?raw=true" width="700">

Made by Han Xiao • :globe_with_meridians: https://hanxiao.github.io

✨Looking for X-as-service? Or more generic and cloud-native solution? Checkout my new project GNES! GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network. GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form.

What is it

BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.

Sentence Encoding/Embedding is a upstream task required in many NLP applications, e.g. sentiment analysis, text classification. The goal is to represent a variable length sentence into a fixed length vector, e.g. hello world to [0.1, 0.3, 0.9]. Each element of the vector should "encode" some semantics of the original sentence.

Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

Highlights

:telescope: State-of-the-art: build on pretrained 12/24-layer BERT models released by Google AI, which is considered as a milestone in the NLP community.
:hatching_chick: Easy-to-use: require only two lines of code to get sentence/token-level encodes.
:zap: Fast: 900 sentences/s on a single Tesla M40 24GB. Low latency, optimized for speed. See benchmark.
:octopus: Scalable: scale nicely and smoothly on multiple GPUs and multiple clients without worrying about concurrency. See benchmark.
:gem: Reliable: tested on multi-billion sentences; days of running without a break or OOM or any nasty exceptions.

More features: XLA & FP16 support; mix GPU-CPU workloads; optimized graph; tf.data friendly; customized tokenizer; flexible pooling strategy; build-in HTTP server and dashboard; async encoding; multicasting; etc.

Install

Install the server and client via pip. They can be installed separately or even on different machines:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Note that the server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!

:point_up: The client can be running on both Python 2 and 3 for the following consideration.

Getting Started

1. Download a Pre-trained BERT Model

Download a model listed below, then uncompress the zip file into some folder, say /tmp/english_L-12_H-768_A-12/

List of released pretrained BERT models (click to expand...)

BERT-Base, Uncased	12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased	12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)	104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)	102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese	Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Optional: fine-tuning the model on your downstream task. Why is it optional?

2. Start the BERT service

After installing the server, you should be able to use bert-serving-start CLI as follows:

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4

This will start a service with four workers, meaning that it can handle up to four concurrent requests. More concurrent requests will be queued in a load balancer. Details can be found in our FAQ and the benchmark on number of clients.

Below shows what the server looks like when starting correctly:

Alternatively, one can start the BERT Service in a Docker Container (click to expand...)

docker build -t bert-as-service -f ./docker/Dockerfile .
NUM_WORKER=1
PATH_MODEL=/PATH_TO/_YOUR_MODEL/
docker run --runtime nvidia -dit -p 5555:5555 -p 5556:5556 -v $PATH_MODEL:/model -t bert-as-service $NUM_WORKER

3. Use Client to Get Sentence Encodes

Now you can encode sentences simply as follows:

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])

It will return a ndarray (or List[List[float]] if you wish), in which each row is a fixed-length vector representing a sentence. Having thousands of sentences? Just encode! Don't even bother to batch, the server will take care of it.

As a feature of BERT, you may get encodes of a pair of sentences by concatenating them with ||| (with whitespace before and after), e.g.

bc.encode(['First do it ||| then do it right'])

Below shows what the server looks like while encoding:

Use BERT Service Remotely

One may also start the service on one (GPU) machine and call it from another (CPU) machine as follows:

# on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx')  # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

Note that you only need pip install -U bert-serving-client in this case, the server side is not required. You may also call the service via HTTP requests.

:bulb: Want to learn more? Checkout our tutorials: - Building a QA semantic search engine in 3 min. - Serving a fine-tuned BERT model - Getting ELMo-like contextual word embedding - Using your own tokenizer - Using BertClient with tf.data API - Training a text classifier using BERT features and tf.estimator API - Saving and loading with TFRecord data - Asynchronous encoding - Broadcasting to multiple clients - Monitoring the service status in a dashboard - Using bert-as-service to serve HTTP requests in JSON - Starting BertServer from Python

Server and Client API

^{▴ Back to top}

The best way to learn bert-as-service latest API is reading the documentation.

Server API

Please always refer to the latest server-side API documented here., you may get the latest usage via:

bert-serving-start --help
bert-serving-terminate --help
bert-serving-benchmark --help

Argument	Type	Default	Description
`model_dir`	str	Required	folder path of the pre-trained BERT model.
`tuned_model_dir`	str	(Optional)	folder path of a fine-tuned BERT model.
`ckpt_name`	str	`bert_model.ckpt`	filename of the checkpoint file.
`config_name`	str	`bert_config.json`	filename of the JSON config file for BERT model.
`graph_tmp_dir`	str	None	path to graph temp file
`max_seq_len`	int	`25`	maximum length of sequence, longer sequence will be trimmed on the right side. Set it to NONE for dynamically using the longest sequence in a (mini)batch.
`cased_tokenization`	bool	False	Whether tokenizer should skip the default lowercasing and accent removal. Should be used for e.g. the multilingual cased pretrained BERT model.
`mask_cls_sep`	bool	False	masking the embedding on [CLS] and [SEP] with zero.
`num_worker`	int	`1`	number of (GPU/CPU) worker runs BERT model, each works in a separate process.
`max_batch_size`	int	`256`	maximum number of sequences handled by each worker, larger batch will be partitioned into small batches.
`priority_batch_size`	int	`16`	batch smaller than this size will be labeled as high priority, and jumps forward in the job queue to get result faster
`port`	int	`5555`	port for pushing data from client to server
`port_out`	int	`5556`	port for publishing results from server to client
`http_port`	int	None	server port for receiving HTTP requests
`cors`	str	`*`	setting "Access-Control-Allow-Origin" for HTTP requests
`pooling_strategy`	str	`REDUCE_MEAN`	the pooling strategy for generating encoding vectors, valid values are `NONE`, `REDUCE_MEAN`, `REDUCE_MAX`, `REDUCE_MEAN_MAX`, `CLS_TOKEN`, `FIRST_TOKEN`, `SEP_TOKEN`, `LAST_TOKEN`. Explanation of these strategies can be found here. To get encoding for each token in the sequence, please set this to `NONE`.
`pooling

Core symbols most depended-on inside this repo

info

called by 34

server/bert_serving/server/helper.py

encode

called by 16

client/bert_serving/client/__init__.py

create_initializer

called by 10

server/bert_serving/server/bert/modeling.py

get_shape_list

called by 9

server/bert_serving/server/bert/modeling.py

set_logger

called by 8

server/bert_serving/server/helper.py

warning

called by 7

server/bert_serving/server/helper.py

error

called by 6

server/bert_serving/server/helper.py

patch_dtype

called by 5

server/bert_serving/server/graph.py

Shape

Method 135

Function 61

Class 25

Route 3

Languages

Python100%

Modules by API surface

server/bert_serving/server/__init__.py43 symbols

client/bert_serving/client/__init__.py38 symbols

server/bert_serving/server/bert/modeling.py30 symbols

server/bert_serving/server/helper.py29 symbols

server/bert_serving/server/bert/tokenization.py27 symbols

server/bert_serving/server/http.py10 symbols

server/bert_serving/server/zmq_decor.py9 symbols

server/bert_serving/server/bert/extract_features.py8 symbols

server/bert_serving/server/graph.py6 symbols

server/bert_serving/server/bert/optimization.py6 symbols

server/bert_serving/server/benchmark.py4 symbols

server/bert_serving/server/cli/__init__.py3 symbols

Dependencies from manifests, versioned

GPUtil1.3.0 · 1×

pyzmq17.1.0 · 1×

For agents

$ claude mcp add clip-as-service \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact