The inference configuration. Args: max_batch_size (int): Maximum batch size, defaults to 8. max_output_len (int): Maximum output length, defaults to 256. max_input_len (int): Maximum input length, defaults to 256. dtype (Union[str, torch.dtype]): The data type fo
| 149 | |
| 150 | @dataclass |
| 151 | class InferenceConfig(RPC_PARAM): |
| 152 | """The inference configuration. |
| 153 | |
| 154 | Args: |
| 155 | max_batch_size (int): Maximum batch size, defaults to 8. |
| 156 | max_output_len (int): Maximum output length, defaults to 256. |
| 157 | max_input_len (int): Maximum input length, defaults to 256. |
| 158 | dtype (Union[str, torch.dtype]): The data type for weights and activations. |
| 159 | kv_cache_dtype (Optional[str]): The data type of kv_cache, defaults to None. |
| 160 | prompt_template (Optional[str]): The prompt template for generation, defaults to None. |
| 161 | do_sample (bool): Whether to use sampling for generation, defaults to False. |
| 162 | beam_width (int): The maximum beam width used to initialize KV Cache, defaults to 1. |
| 163 | During generation, the beam width provided as sampling parameter should be less than or equivalent to this value. |
| 164 | prefill_ratio (Optional[float]): A controling ratio for prefill and decoding in running list, defaults to 1.2. We will do a step of prefill |
| 165 | when the actual value exceeds this ratio. |
| 166 | pad_input: Whether to pad all inputs to the max length. |
| 167 | early_stopping (Optional[bool]): Whether to stop the generation when all beam hypotheses have finished or not, defaults to False. |
| 168 | top_k (Optional[int]): The number of highest probability vocabulary tokens to keep for top-k-filtering, defaults to None. |
| 169 | top_p (Optional[float]): The cumulative probability threshold for retaining tokens with a total probability above it, defaults to None. |
| 170 | temperature (Optional[float]): Randomness used to control randomization, defaults to 1.0. |
| 171 | no_repeat_ngram_size (Optional[int]): If no_repeat_ngram_size > 0, the consecutive tokens of ngram size can only appear once in inference sentences. |
| 172 | repetition_penalty (Optional[float]): The parameter that influences the model's treatment of new tokens in relation to their appearance in the prompt and the generated text. Values greater than 1 incentivize the model to introduce new tokens, whereas values less than 1 incentivize token repetition., defaults to 1.0. |
| 173 | ignore_eos(bool): Whether to ignore the EOS token and continue generating tokens when encountering the EOS token. |
| 174 | use_spec_dec (bool): Indicate whether to use speculative decoding, defaults to False. |
| 175 | max_n_spec_tokens (int): The maximum number of speculating tokens, defaults to None. |
| 176 | glimpse_large_kv (bool): Whether to use large KV in drafter model, defaults to False. |
| 177 | block_size (int): The number of blocks in a logical block, defaults to 16. |
| 178 | tp_size (int): Tensor parallel size, defaults to 1. |
| 179 | pp_size (int): Pipeline parallel size, defaults to 1. |
| 180 | micro_batch_size (int): the micro batch size, defaults to 1. Only useful when `pp_size` > 1. |
| 181 | micro_batch_buffer_size (int): the buffer size for micro batch. Normally, it should be the same as the number of pipeline stages. |
| 182 | use_cuda_kernel(bool): Whether to use cuda kernel, faster but lose some precision occasionally |
| 183 | high_precision(Optional[bool]): Whether to use float32 for underlying calculations of float16 data to achieve higher precision, defaults to False. |
| 184 | use_cuda_graph (bool): Whether to enforce CUDA graph execution. If False, we will disable CUDA graph and always execute the model in eager mode. If True, we will use eager execution in hybrid. |
| 185 | max_context_len_to_capture (int): max context len that could be captured by CUDA Graph, per sequence |
| 186 | enable_streamingllm(bool): Whether to use StreamingLLM, the relevant algorithms refer to the paper at https://arxiv.org/pdf/2309.17453 for implementation. |
| 187 | start_token_size(int): The size of the start tokens, when using StreamingLLM. |
| 188 | generated_token_size(int): The size of the generated tokens, When using StreamingLLM. |
| 189 | patched_parallelism_size(int): Patched Parallelism Size, When using Distrifusion |
| 190 | """ |
| 191 | |
| 192 | # NOTE: arrange configs according to their importance and frequency of usage |
| 193 | |
| 194 | # runtime limit |
| 195 | max_batch_size: int = 8 |
| 196 | max_output_len: int = 256 |
| 197 | max_input_len: int = 256 |
| 198 | |
| 199 | # general configs |
| 200 | dtype: Union[str, torch.dtype] = torch.float16 # use fp16 by default |
| 201 | kv_cache_dtype: Optional[str] = None |
| 202 | |
| 203 | # generation configs |
| 204 | prompt_template: Optional[str] = None |
| 205 | do_sample: bool = False |
| 206 | beam_width: int = 1 # TODO: beam search is not support for now |
| 207 | prefill_ratio: Optional[float] = ( |
| 208 | 1.2 # the ratio of prefill sequences to decoding sequences, we do prefill step once the actual value exceeds ratio |
no outgoing calls
searching dependent graphs…