MCPcopy Index your code
hub / github.com/InternLM/lmdeploy / TurbomindEngineConfig

Class TurbomindEngineConfig

lmdeploy/messages.py:206–330  ·  view source on GitHub ↗

TurboMind Engine config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 m

Source from the content-addressed store, hash-verified

204
205@pydantic_dataclass
206class TurbomindEngineConfig:
207 """TurboMind Engine config.
208
209 Args:
210 dtype: data type for model weights and activations. It can be
211 one of the following values, ['auto', 'float16', 'bfloat16']
212 The `auto` option will use FP16 precision for FP32 and FP16
213 models, and BF16 precision for BF16 models.
214 model_format: the layout of the deployed model. It can be one
215 of the following values [hf, awq, gptq, compressed-tensors,
216 fp8, mxfp4]. `hf` means a Hugging Face model (.bin,
217 .safetensors), `awq` and `gptq` mean grouped 4-bit
218 weight-only checkpoints, `compressed-tensors` means
219 pack-quantized grouped int4 checkpoints and is usually
220 auto-detected from the input model config, `fp8` means
221 blocked fp8 checkpoints, and `mxfp4` means MXFP4 expert
222 weights. If it is not specified, i.e. None, it will be
223 extracted from the input model
224 tp: the number of GPU cards used in tensor parallelism,
225 default to 1
226 session_len: the max session length of a sequence, default to
227 None
228 max_batch_size: the max batch size during inference. If it is
229 not specified, the engine will automatically set it according to
230 the device
231 cache_max_entry_count: the percentage of gpu memory occupied
232 by the k/v cache.
233 For versions of lmdeploy between `v0.2.0` and `v0.2.1`, it
234 defaults to 0.5, depicting the percentage of TOTAL GPU memory to
235 be allocated to the k/v cache.
236 For lmdeploy versions greater than `v0.2.1`, it defaults to 0.8,
237 signifying the percentage of FREE GPU memory to be reserved for
238 the k/v cache.
239 When it's an integer > 0, it represents the total number of k/v
240 blocks.
241 cache_chunk_size: The policy to apply for KV block from
242 the block manager, default to -1.
243 cache_block_seq_len: the length of the token sequence in
244 a k/v block, default to 64
245 enable_prefix_caching: enable cache prompts for block reuse,
246 default to False
247 quant_policy: default to 0. For TurboMind, when k/v is quantized
248 into int4 or int8, set it to 4 or 8, respectively
249 rope_scaling_factor: scaling factor used for dynamic ntk,
250 default to 0. TurboMind follows the implementation of transformer
251 LlamaAttention
252 use_logn_attn: whether or not to use log attn: default to False
253 download_dir: Directory to download and load the weights,
254 default to the default cache directory of huggingface.
255 revision: The specific model version to use. It can be a branch
256 name, a tag name, or a commit id. If unspecified, will use the
257 default version.
258 max_prefill_token_num: the number of tokens each iteration during
259 prefill, default to 8192
260 num_tokens_per_iter: the number of tokens processed in each
261 forward pass. Working with `max_prefill_iters` enables the
262 "Dynamic SplitFuse"-like scheduling
263 max_prefill_iters: the max number of forward pass during prefill

Callers 15

__init__Method · 0.90
__init__Method · 0.90
build_pipeFunction · 0.90
api_serverMethod · 0.90
run_pipeline_chat_testFunction · 0.90
run_pipeline_mllm_testFunction · 0.90
passkey_retrival_workerFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
run_smoke_inferFunction · 0.90
mainFunction · 0.90

Calls

no outgoing calls

Tested by 5

passkey_retrival_workerFunction · 0.72
run_smoke_inferFunction · 0.72
mainFunction · 0.72
backend_configMethod · 0.72