MCPcopy
hub / github.com/InternLM/lmdeploy / PytorchEngineConfig

Class PytorchEngineConfig

lmdeploy/messages.py:334–491  ·  view source on GitHub ↗

PyTorch Engine Config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 mod

Source from the content-addressed store, hash-verified

332
333@dataclass
334class PytorchEngineConfig:
335 """PyTorch Engine Config.
336
337 Args:
338 dtype: data type for model weights and activations. It can be
339 one of the following values, ['auto', 'float16', 'bfloat16']
340 The `auto` option will use FP16 precision for FP32 and FP16
341 models, and BF16 precision for BF16 models.
342 tp: Tensor Parallelism. default 1.
343 dp: Data Parallelism. default 1.
344 dp_rank: rank of dp.
345 ep: Expert Parallelism. default 1.
346 session_len: Max session length. Default None.
347 max_batch_size: Max batch size. If it is not specified,
348 the engine will automatically set it according to the device
349 attn_tp_size: tp size for attention, only works for dp>1
350 mlp_tp_size: tp size for mlp, only works for dp>1
351 moe_tp_size: tp size for moe, only works for dp>1
352 cache_max_entry_count: the percentage of gpu memory occupied
353 by the k/v cache. For lmdeploy versions greater than `v0.2.1`,
354 it defaults to 0.8, signifying the percentage of FREE GPU memory
355 to be reserved for the k/v cache
356 prefill_interval: Interval to perform prefill,
357 Default 16.
358 block_size: paging cache block size, default 64.
359 num_cpu_blocks: Num cpu blocks. If num is 0, cache
360 would be allocate according to current environment.
361 num_gpu_blocks: Num gpu blocks. If num is 0, cache
362 would be allocate according to current environment.
363 adapters: The path configs to lora adapters.
364 max_prefill_token_num: tokens per iteration.
365 thread_safe: thread safe engine instance.
366 enable_prefix_caching: Enable token match and sharing caches.
367 device_type: The inference device type, options ['cuda']
368 eager_mode: Enable "eager" mode or not
369 custom_module_map: nn module map customized by users. Once
370 provided, the original nn modules of the model will be
371 substituted by the mapping ones
372 download_dir: Directory to download and load the weights,
373 default to the default cache directory of huggingface.
374 revision: The specific model version to use.
375 It can be a branch name, a tag name, or a commit id.
376 If unspecified, will use the default version.
377 quant_policy: default to 0. When k/v is quantized into int4,
378 int8, fp8, or fp8_e5m2, set it to 4, 8, 16, or 17,
379 respectively
380 distributed_executor_backend: backend of distributed backend,
381 options: ['uni', 'mp', 'ray']
382 empty_init: Whether to load the model weights, you should set
383 it to True if you want to update weights after create the pipeline
384 enable_microbatch: enable microbatch for specified model
385 enable_eplb: enable eplb for specified model
386 enable_metrics: enable metrics system
387 role: role of engin, options: ['Hybrid', 'Prefill',
388 'Decode']. Default to `EngineRole.Hybrid`.
389 migration_backend: migration backend. options: ['DLSlime'].
390 Default to `MigrationBackend.DLSlime`.
391 enable_mp_engine: run engine in multi-process mode.

Callers 15

__init__Method · 0.90
build_pipeFunction · 0.90
api_serverMethod · 0.90
update_engine_configMethod · 0.90
_init_rayMethod · 0.90
run_pipeline_chat_testFunction · 0.90
run_pipeline_mllm_testFunction · 0.90
passkey_retrival_workerFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90

Calls

no outgoing calls