hub / github.com/hpcaitech/ColossalAI / InferenceConfig

Class InferenceConfig

colossalai/inference/config.py:151–395 · view source on GitHub ↗

The inference configuration. Args: max_batch_size (int): Maximum batch size, defaults to 8. max_output_len (int): Maximum output length, defaults to 256. max_input_len (int): Maximum input length, defaults to 256. dtype (Union[str, torch.dtype]): The data type fo

Source from the content-addressed store, hash-verified

149
150	@dataclass
151	class InferenceConfig(RPC_PARAM):
152	"""The inference configuration.
153
154	Args:
155	max_batch_size (int): Maximum batch size, defaults to 8.
156	max_output_len (int): Maximum output length, defaults to 256.
157	max_input_len (int): Maximum input length, defaults to 256.
158	dtype (Union[str, torch.dtype]): The data type for weights and activations.
159	kv_cache_dtype (Optional[str]): The data type of kv_cache, defaults to None.
160	prompt_template (Optional[str]): The prompt template for generation, defaults to None.
161	do_sample (bool): Whether to use sampling for generation, defaults to False.
162	beam_width (int): The maximum beam width used to initialize KV Cache, defaults to 1.
163	During generation, the beam width provided as sampling parameter should be less than or equivalent to this value.
164	prefill_ratio (Optional[float]): A controling ratio for prefill and decoding in running list, defaults to 1.2. We will do a step of prefill
165	when the actual value exceeds this ratio.
166	pad_input: Whether to pad all inputs to the max length.
167	early_stopping (Optional[bool]): Whether to stop the generation when all beam hypotheses have finished or not, defaults to False.
168	top_k (Optional[int]): The number of highest probability vocabulary tokens to keep for top-k-filtering, defaults to None.
169	top_p (Optional[float]): The cumulative probability threshold for retaining tokens with a total probability above it, defaults to None.
170	temperature (Optional[float]): Randomness used to control randomization, defaults to 1.0.
171	no_repeat_ngram_size (Optional[int]): If no_repeat_ngram_size > 0, the consecutive tokens of ngram size can only appear once in inference sentences.
172	repetition_penalty (Optional[float]): The parameter that influences the model's treatment of new tokens in relation to their appearance in the prompt and the generated text. Values greater than 1 incentivize the model to introduce new tokens, whereas values less than 1 incentivize token repetition., defaults to 1.0.
173	ignore_eos(bool): Whether to ignore the EOS token and continue generating tokens when encountering the EOS token.
174	use_spec_dec (bool): Indicate whether to use speculative decoding, defaults to False.
175	max_n_spec_tokens (int): The maximum number of speculating tokens, defaults to None.
176	glimpse_large_kv (bool): Whether to use large KV in drafter model, defaults to False.
177	block_size (int): The number of blocks in a logical block, defaults to 16.
178	tp_size (int): Tensor parallel size, defaults to 1.
179	pp_size (int): Pipeline parallel size, defaults to 1.
180	micro_batch_size (int): the micro batch size, defaults to 1. Only useful when `pp_size` > 1.
181	micro_batch_buffer_size (int): the buffer size for micro batch. Normally, it should be the same as the number of pipeline stages.
182	use_cuda_kernel(bool): Whether to use cuda kernel, faster but lose some precision occasionally
183	high_precision(Optional[bool]): Whether to use float32 for underlying calculations of float16 data to achieve higher precision, defaults to False.
184	use_cuda_graph (bool): Whether to enforce CUDA graph execution. If False, we will disable CUDA graph and always execute the model in eager mode. If True, we will use eager execution in hybrid.
185	max_context_len_to_capture (int): max context len that could be captured by CUDA Graph, per sequence
186	enable_streamingllm(bool): Whether to use StreamingLLM, the relevant algorithms refer to the paper at https://arxiv.org/pdf/2309.17453 for implementation.
187	start_token_size(int): The size of the start tokens, when using StreamingLLM.
188	generated_token_size(int): The size of the generated tokens, When using StreamingLLM.
189	patched_parallelism_size(int): Patched Parallelism Size, When using Distrifusion
190	"""
191
192	# NOTE: arrange configs according to their importance and frequency of usage
193
194	# runtime limit
195	max_batch_size: int = 8
196	max_output_len: int = 256
197	max_input_len: int = 256
198
199	# general configs
200	dtype: Union[str, torch.dtype] = torch.float16 # use fp16 by default
201	kv_cache_dtype: Optional[str] = None
202
203	# generation configs
204	prompt_template: Optional[str] = None
205	do_sample: bool = False
206	beam_width: int = 1 # TODO: beam search is not support for now
207	prefill_ratio: Optional[float] = (
208	1.2 # the ratio of prefill sequences to decoding sequences, we do prefill step once the actual value exceeds ratio

Callers 15

test_bucketFunction · 0.90

check_inference_engineFunction · 0.90

check_config_and_inferenceFunction · 0.90

check_request_handlerFunction · 0.90

check_inference_engineFunction · 0.90

check_streamingllmFunction · 0.90

check_cache_managerFunction · 0.90

check_inference_engineFunction · 0.90

check_spec_decFunction · 0.90

_run_engineFunction · 0.90

check_inference_engineFunction · 0.90

Calls

no outgoing calls

Tested by 12

test_bucketFunction · 0.72

check_inference_engineFunction · 0.72

check_config_and_inferenceFunction · 0.72

check_request_handlerFunction · 0.72

check_inference_engineFunction · 0.72

check_streamingllmFunction · 0.72

check_cache_managerFunction · 0.72

check_inference_engineFunction · 0.72

check_spec_decFunction · 0.72

_run_engineFunction · 0.72

check_inference_engineFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…