PyTorch Engine Config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 mod
| 332 | |
| 333 | @dataclass |
| 334 | class PytorchEngineConfig: |
| 335 | """PyTorch Engine Config. |
| 336 | |
| 337 | Args: |
| 338 | dtype: data type for model weights and activations. It can be |
| 339 | one of the following values, ['auto', 'float16', 'bfloat16'] |
| 340 | The `auto` option will use FP16 precision for FP32 and FP16 |
| 341 | models, and BF16 precision for BF16 models. |
| 342 | tp: Tensor Parallelism. default 1. |
| 343 | dp: Data Parallelism. default 1. |
| 344 | dp_rank: rank of dp. |
| 345 | ep: Expert Parallelism. default 1. |
| 346 | session_len: Max session length. Default None. |
| 347 | max_batch_size: Max batch size. If it is not specified, |
| 348 | the engine will automatically set it according to the device |
| 349 | attn_tp_size: tp size for attention, only works for dp>1 |
| 350 | mlp_tp_size: tp size for mlp, only works for dp>1 |
| 351 | moe_tp_size: tp size for moe, only works for dp>1 |
| 352 | cache_max_entry_count: the percentage of gpu memory occupied |
| 353 | by the k/v cache. For lmdeploy versions greater than `v0.2.1`, |
| 354 | it defaults to 0.8, signifying the percentage of FREE GPU memory |
| 355 | to be reserved for the k/v cache |
| 356 | prefill_interval: Interval to perform prefill, |
| 357 | Default 16. |
| 358 | block_size: paging cache block size, default 64. |
| 359 | num_cpu_blocks: Num cpu blocks. If num is 0, cache |
| 360 | would be allocate according to current environment. |
| 361 | num_gpu_blocks: Num gpu blocks. If num is 0, cache |
| 362 | would be allocate according to current environment. |
| 363 | adapters: The path configs to lora adapters. |
| 364 | max_prefill_token_num: tokens per iteration. |
| 365 | thread_safe: thread safe engine instance. |
| 366 | enable_prefix_caching: Enable token match and sharing caches. |
| 367 | device_type: The inference device type, options ['cuda'] |
| 368 | eager_mode: Enable "eager" mode or not |
| 369 | custom_module_map: nn module map customized by users. Once |
| 370 | provided, the original nn modules of the model will be |
| 371 | substituted by the mapping ones |
| 372 | download_dir: Directory to download and load the weights, |
| 373 | default to the default cache directory of huggingface. |
| 374 | revision: The specific model version to use. |
| 375 | It can be a branch name, a tag name, or a commit id. |
| 376 | If unspecified, will use the default version. |
| 377 | quant_policy: default to 0. When k/v is quantized into int4, |
| 378 | int8, fp8, or fp8_e5m2, set it to 4, 8, 16, or 17, |
| 379 | respectively |
| 380 | distributed_executor_backend: backend of distributed backend, |
| 381 | options: ['uni', 'mp', 'ray'] |
| 382 | empty_init: Whether to load the model weights, you should set |
| 383 | it to True if you want to update weights after create the pipeline |
| 384 | enable_microbatch: enable microbatch for specified model |
| 385 | enable_eplb: enable eplb for specified model |
| 386 | enable_metrics: enable metrics system |
| 387 | role: role of engin, options: ['Hybrid', 'Prefill', |
| 388 | 'Decode']. Default to `EngineRole.Hybrid`. |
| 389 | migration_backend: migration backend. options: ['DLSlime']. |
| 390 | Default to `MigrationBackend.DLSlime`. |
| 391 | enable_mp_engine: run engine in multi-process mode. |
no outgoing calls