hub / github.com/InternLM/lmdeploy / TurbomindEngineConfig

Class TurbomindEngineConfig

lmdeploy/messages.py:206–330 · view source on GitHub ↗

TurboMind Engine config. Args: dtype: data type for model weights and activations. It can be one of the following values, ['auto', 'float16', 'bfloat16'] The `auto` option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 m

Source from the content-addressed store, hash-verified

204
205	@pydantic_dataclass
206	class TurbomindEngineConfig:
207	"""TurboMind Engine config.
208
209	Args:
210	dtype: data type for model weights and activations. It can be
211	one of the following values, ['auto', 'float16', 'bfloat16']
212	The `auto` option will use FP16 precision for FP32 and FP16
213	models, and BF16 precision for BF16 models.
214	model_format: the layout of the deployed model. It can be one
215	of the following values [hf, awq, gptq, compressed-tensors,
216	fp8, mxfp4]. `hf` means a Hugging Face model (.bin,
217	.safetensors), `awq` and `gptq` mean grouped 4-bit
218	weight-only checkpoints, `compressed-tensors` means
219	pack-quantized grouped int4 checkpoints and is usually
220	auto-detected from the input model config, `fp8` means
221	blocked fp8 checkpoints, and `mxfp4` means MXFP4 expert
222	weights. If it is not specified, i.e. None, it will be
223	extracted from the input model
224	tp: the number of GPU cards used in tensor parallelism,
225	default to 1
226	session_len: the max session length of a sequence, default to
227	None
228	max_batch_size: the max batch size during inference. If it is
229	not specified, the engine will automatically set it according to
230	the device
231	cache_max_entry_count: the percentage of gpu memory occupied
232	by the k/v cache.
233	For versions of lmdeploy between `v0.2.0` and `v0.2.1`, it
234	defaults to 0.5, depicting the percentage of TOTAL GPU memory to
235	be allocated to the k/v cache.
236	For lmdeploy versions greater than `v0.2.1`, it defaults to 0.8,
237	signifying the percentage of FREE GPU memory to be reserved for
238	the k/v cache.
239	When it's an integer > 0, it represents the total number of k/v
240	blocks.
241	cache_chunk_size: The policy to apply for KV block from
242	the block manager, default to -1.
243	cache_block_seq_len: the length of the token sequence in
244	a k/v block, default to 64
245	enable_prefix_caching: enable cache prompts for block reuse,
246	default to False
247	quant_policy: default to 0. For TurboMind, when k/v is quantized
248	into int4 or int8, set it to 4 or 8, respectively
249	rope_scaling_factor: scaling factor used for dynamic ntk,
250	default to 0. TurboMind follows the implementation of transformer
251	LlamaAttention
252	use_logn_attn: whether or not to use log attn: default to False
253	download_dir: Directory to download and load the weights,
254	default to the default cache directory of huggingface.
255	revision: The specific model version to use. It can be a branch
256	name, a tag name, or a commit id. If unspecified, will use the
257	default version.
258	max_prefill_token_num: the number of tokens each iteration during
259	prefill, default to 8192
260	num_tokens_per_iter: the number of tokens processed in each
261	forward pass. Working with `max_prefill_iters` enables the
262	"Dynamic SplitFuse"-like scheduling
263	max_prefill_iters: the max number of forward pass during prefill

Callers 15

__init__Method · 0.90

build_pipeFunction · 0.90

api_serverMethod · 0.90

run_pipeline_chat_testFunction · 0.90

run_pipeline_mllm_testFunction · 0.90

passkey_retrival_workerFunction · 0.90

mainFunction · 0.90

run_smoke_inferFunction · 0.90

mainFunction · 0.90

Calls

no outgoing calls

Tested by 5

passkey_retrival_workerFunction · 0.72

run_smoke_inferFunction · 0.72

mainFunction · 0.72

backend_configMethod · 0.72

test_turbomind_config_rejects_fp8_quant_policiesFunction · 0.72