MCPcopy
hub / github.com/hpcaitech/ColossalAI / InferenceConfig

Class InferenceConfig

colossalai/inference/config.py:151–395  ·  view source on GitHub ↗

The inference configuration. Args: max_batch_size (int): Maximum batch size, defaults to 8. max_output_len (int): Maximum output length, defaults to 256. max_input_len (int): Maximum input length, defaults to 256. dtype (Union[str, torch.dtype]): The data type fo

Source from the content-addressed store, hash-verified

149
150@dataclass
151class InferenceConfig(RPC_PARAM):
152 """The inference configuration.
153
154 Args:
155 max_batch_size (int): Maximum batch size, defaults to 8.
156 max_output_len (int): Maximum output length, defaults to 256.
157 max_input_len (int): Maximum input length, defaults to 256.
158 dtype (Union[str, torch.dtype]): The data type for weights and activations.
159 kv_cache_dtype (Optional[str]): The data type of kv_cache, defaults to None.
160 prompt_template (Optional[str]): The prompt template for generation, defaults to None.
161 do_sample (bool): Whether to use sampling for generation, defaults to False.
162 beam_width (int): The maximum beam width used to initialize KV Cache, defaults to 1.
163 During generation, the beam width provided as sampling parameter should be less than or equivalent to this value.
164 prefill_ratio (Optional[float]): A controling ratio for prefill and decoding in running list, defaults to 1.2. We will do a step of prefill
165 when the actual value exceeds this ratio.
166 pad_input: Whether to pad all inputs to the max length.
167 early_stopping (Optional[bool]): Whether to stop the generation when all beam hypotheses have finished or not, defaults to False.
168 top_k (Optional[int]): The number of highest probability vocabulary tokens to keep for top-k-filtering, defaults to None.
169 top_p (Optional[float]): The cumulative probability threshold for retaining tokens with a total probability above it, defaults to None.
170 temperature (Optional[float]): Randomness used to control randomization, defaults to 1.0.
171 no_repeat_ngram_size (Optional[int]): If no_repeat_ngram_size > 0, the consecutive tokens of ngram size can only appear once in inference sentences.
172 repetition_penalty (Optional[float]): The parameter that influences the model's treatment of new tokens in relation to their appearance in the prompt and the generated text. Values greater than 1 incentivize the model to introduce new tokens, whereas values less than 1 incentivize token repetition., defaults to 1.0.
173 ignore_eos(bool): Whether to ignore the EOS token and continue generating tokens when encountering the EOS token.
174 use_spec_dec (bool): Indicate whether to use speculative decoding, defaults to False.
175 max_n_spec_tokens (int): The maximum number of speculating tokens, defaults to None.
176 glimpse_large_kv (bool): Whether to use large KV in drafter model, defaults to False.
177 block_size (int): The number of blocks in a logical block, defaults to 16.
178 tp_size (int): Tensor parallel size, defaults to 1.
179 pp_size (int): Pipeline parallel size, defaults to 1.
180 micro_batch_size (int): the micro batch size, defaults to 1. Only useful when `pp_size` > 1.
181 micro_batch_buffer_size (int): the buffer size for micro batch. Normally, it should be the same as the number of pipeline stages.
182 use_cuda_kernel(bool): Whether to use cuda kernel, faster but lose some precision occasionally
183 high_precision(Optional[bool]): Whether to use float32 for underlying calculations of float16 data to achieve higher precision, defaults to False.
184 use_cuda_graph (bool): Whether to enforce CUDA graph execution. If False, we will disable CUDA graph and always execute the model in eager mode. If True, we will use eager execution in hybrid.
185 max_context_len_to_capture (int): max context len that could be captured by CUDA Graph, per sequence
186 enable_streamingllm(bool): Whether to use StreamingLLM, the relevant algorithms refer to the paper at https://arxiv.org/pdf/2309.17453 for implementation.
187 start_token_size(int): The size of the start tokens, when using StreamingLLM.
188 generated_token_size(int): The size of the generated tokens, When using StreamingLLM.
189 patched_parallelism_size(int): Patched Parallelism Size, When using Distrifusion
190 """
191
192 # NOTE: arrange configs according to their importance and frequency of usage
193
194 # runtime limit
195 max_batch_size: int = 8
196 max_output_len: int = 256
197 max_input_len: int = 256
198
199 # general configs
200 dtype: Union[str, torch.dtype] = torch.float16 # use fp16 by default
201 kv_cache_dtype: Optional[str] = None
202
203 # generation configs
204 prompt_template: Optional[str] = None
205 do_sample: bool = False
206 beam_width: int = 1 # TODO: beam search is not support for now
207 prefill_ratio: Optional[float] = (
208 1.2 # the ratio of prefill sequences to decoding sequences, we do prefill step once the actual value exceeds ratio

Callers 15

test_bucketFunction · 0.90
check_inference_engineFunction · 0.90
check_request_handlerFunction · 0.90
check_inference_engineFunction · 0.90
check_inference_engineFunction · 0.90
check_streamingllmFunction · 0.90
check_cache_managerFunction · 0.90
check_inference_engineFunction · 0.90
check_spec_decFunction · 0.90
_run_engineFunction · 0.90
check_inference_engineFunction · 0.90

Calls

no outgoing calls

Tested by 12

test_bucketFunction · 0.72
check_inference_engineFunction · 0.72
check_request_handlerFunction · 0.72
check_inference_engineFunction · 0.72
check_inference_engineFunction · 0.72
check_streamingllmFunction · 0.72
check_cache_managerFunction · 0.72
check_inference_engineFunction · 0.72
check_spec_decFunction · 0.72
_run_engineFunction · 0.72
check_inference_engineFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…