hub / github.com/InternLM/lmdeploy / PytorchEngineConfig

Class PytorchEngineConfig

lmdeploy/messages.py:334–491 · view source on GitHub ↗

PyTorch Engine Config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 mod

Source from the content-addressed store, hash-verified

332
333	@dataclass
334	class PytorchEngineConfig:
335	"""PyTorch Engine Config.
336
337	Args:
338	dtype: data type for model weights and activations. It can be
339	one of the following values, ['auto', 'float16', 'bfloat16']
340	The `auto` option will use FP16 precision for FP32 and FP16
341	models, and BF16 precision for BF16 models.
342	tp: Tensor Parallelism. default 1.
343	dp: Data Parallelism. default 1.
344	dp_rank: rank of dp.
345	ep: Expert Parallelism. default 1.
346	session_len: Max session length. Default None.
347	max_batch_size: Max batch size. If it is not specified,
348	the engine will automatically set it according to the device
349	attn_tp_size: tp size for attention, only works for dp>1
350	mlp_tp_size: tp size for mlp, only works for dp>1
351	moe_tp_size: tp size for moe, only works for dp>1
352	cache_max_entry_count: the percentage of gpu memory occupied
353	by the k/v cache. For lmdeploy versions greater than `v0.2.1`,
354	it defaults to 0.8, signifying the percentage of FREE GPU memory
355	to be reserved for the k/v cache
356	prefill_interval: Interval to perform prefill,
357	Default 16.
358	block_size: paging cache block size, default 64.
359	num_cpu_blocks: Num cpu blocks. If num is 0, cache
360	would be allocate according to current environment.
361	num_gpu_blocks: Num gpu blocks. If num is 0, cache
362	would be allocate according to current environment.
363	adapters: The path configs to lora adapters.
364	max_prefill_token_num: tokens per iteration.
365	thread_safe: thread safe engine instance.
366	enable_prefix_caching: Enable token match and sharing caches.
367	device_type: The inference device type, options ['cuda']
368	eager_mode: Enable "eager" mode or not
369	custom_module_map: nn module map customized by users. Once
370	provided, the original nn modules of the model will be
371	substituted by the mapping ones
372	download_dir: Directory to download and load the weights,
373	default to the default cache directory of huggingface.
374	revision: The specific model version to use.
375	It can be a branch name, a tag name, or a commit id.
376	If unspecified, will use the default version.
377	quant_policy: default to 0. When k/v is quantized into int4,
378	int8, fp8, or fp8_e5m2, set it to 4, 8, 16, or 17,
379	respectively
380	distributed_executor_backend: backend of distributed backend,
381	options: ['uni', 'mp', 'ray']
382	empty_init: Whether to load the model weights, you should set
383	it to True if you want to update weights after create the pipeline
384	enable_microbatch: enable microbatch for specified model
385	enable_eplb: enable eplb for specified model
386	enable_metrics: enable metrics system
387	role: role of engin, options: ['Hybrid', 'Prefill',
388	'Decode']. Default to `EngineRole.Hybrid`.
389	migration_backend: migration backend. options: ['DLSlime'].
390	Default to `MigrationBackend.DLSlime`.
391	enable_mp_engine: run engine in multi-process mode.

Callers 15

__init__Method · 0.90

build_pipeFunction · 0.90

api_serverMethod · 0.90

update_engine_configMethod · 0.90

_init_rayMethod · 0.90

run_pipeline_chat_testFunction · 0.90

run_pipeline_mllm_testFunction · 0.90

passkey_retrival_workerFunction · 0.90

mainFunction · 0.90

test_engine_checker_rejects_split_kernel_blocks_for_pd_migrationFunction · 0.90

Calls

no outgoing calls

Tested by 13

passkey_retrival_workerFunction · 0.72

test_engine_checker_rejects_split_kernel_blocks_for_pd_migrationFunction · 0.72

test_engine_checker_allows_split_kernel_blocks_for_hybrid_engineFunction · 0.72

pipe_no_quantFunction · 0.72

pipe_quant_42Function · 0.72

pipeMethod · 0.72

pipe_no_quantMethod · 0.72

pipe_quant_fp8Method · 0.72

backend_configMethod · 0.72

test_pytorch_config_accepts_fp8_quant_policiesFunction · 0.72

test_pytorch_config_normalizes_quant_policy_valueFunction · 0.72