TurboMind Engine config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 m
| 204 | |
| 205 | @pydantic_dataclass |
| 206 | class TurbomindEngineConfig: |
| 207 | """TurboMind Engine config. |
| 208 | |
| 209 | Args: |
| 210 | dtype: data type for model weights and activations. It can be |
| 211 | one of the following values, ['auto', 'float16', 'bfloat16'] |
| 212 | The `auto` option will use FP16 precision for FP32 and FP16 |
| 213 | models, and BF16 precision for BF16 models. |
| 214 | model_format: the layout of the deployed model. It can be one |
| 215 | of the following values [hf, awq, gptq, compressed-tensors, |
| 216 | fp8, mxfp4]. `hf` means a Hugging Face model (.bin, |
| 217 | .safetensors), `awq` and `gptq` mean grouped 4-bit |
| 218 | weight-only checkpoints, `compressed-tensors` means |
| 219 | pack-quantized grouped int4 checkpoints and is usually |
| 220 | auto-detected from the input model config, `fp8` means |
| 221 | blocked fp8 checkpoints, and `mxfp4` means MXFP4 expert |
| 222 | weights. If it is not specified, i.e. None, it will be |
| 223 | extracted from the input model |
| 224 | tp: the number of GPU cards used in tensor parallelism, |
| 225 | default to 1 |
| 226 | session_len: the max session length of a sequence, default to |
| 227 | None |
| 228 | max_batch_size: the max batch size during inference. If it is |
| 229 | not specified, the engine will automatically set it according to |
| 230 | the device |
| 231 | cache_max_entry_count: the percentage of gpu memory occupied |
| 232 | by the k/v cache. |
| 233 | For versions of lmdeploy between `v0.2.0` and `v0.2.1`, it |
| 234 | defaults to 0.5, depicting the percentage of TOTAL GPU memory to |
| 235 | be allocated to the k/v cache. |
| 236 | For lmdeploy versions greater than `v0.2.1`, it defaults to 0.8, |
| 237 | signifying the percentage of FREE GPU memory to be reserved for |
| 238 | the k/v cache. |
| 239 | When it's an integer > 0, it represents the total number of k/v |
| 240 | blocks. |
| 241 | cache_chunk_size: The policy to apply for KV block from |
| 242 | the block manager, default to -1. |
| 243 | cache_block_seq_len: the length of the token sequence in |
| 244 | a k/v block, default to 64 |
| 245 | enable_prefix_caching: enable cache prompts for block reuse, |
| 246 | default to False |
| 247 | quant_policy: default to 0. For TurboMind, when k/v is quantized |
| 248 | into int4 or int8, set it to 4 or 8, respectively |
| 249 | rope_scaling_factor: scaling factor used for dynamic ntk, |
| 250 | default to 0. TurboMind follows the implementation of transformer |
| 251 | LlamaAttention |
| 252 | use_logn_attn: whether or not to use log attn: default to False |
| 253 | download_dir: Directory to download and load the weights, |
| 254 | default to the default cache directory of huggingface. |
| 255 | revision: The specific model version to use. It can be a branch |
| 256 | name, a tag name, or a commit id. If unspecified, will use the |
| 257 | default version. |
| 258 | max_prefill_token_num: the number of tokens each iteration during |
| 259 | prefill, default to 8192 |
| 260 | num_tokens_per_iter: the number of tokens processed in each |
| 261 | forward pass. Working with `max_prefill_iters` enables the |
| 262 | "Dynamic SplitFuse"-like scheduling |
| 263 | max_prefill_iters: the max number of forward pass during prefill |
no outgoing calls