Sampling parameters for text generation. Usage Examples: use_beam_search is False: - best_of is None: (top-p/top-k) sampling n responses and return n generations - best_of is not None: (top-p/top-k) sampling best_of responses and return n generations (best_of >=
| 111 | |
| 112 | @dataclass(slots=True, kw_only=True) |
| 113 | class SamplingParams: |
| 114 | """Sampling parameters for text generation. |
| 115 | |
| 116 | Usage Examples: |
| 117 | |
| 118 | use_beam_search is False: |
| 119 | - best_of is None: (top-p/top-k) sampling n responses and return n generations |
| 120 | - best_of is not None: (top-p/top-k) sampling best_of responses and return n generations (best_of >= n must hold) |
| 121 | use_beam_search is True: |
| 122 | - best_of is None: beam search with beam width of n, return n generations |
| 123 | - best_of is not None: beam search with beam width of best_of, return n generations (best_of >= n must hold) |
| 124 | |
| 125 | Args: |
| 126 | end_id (int, optional): The end token id. Defaults to None. |
| 127 | pad_id (int, optional): The pad token id. Defaults to None. |
| 128 | max_tokens (int): The maximum number of tokens to generate. Defaults to 32. |
| 129 | bad (str, List[str], optional): A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None. |
| 130 | bad_token_ids (List[int], optional): A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None. |
| 131 | stop (str, List[str], optional): A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None. |
| 132 | stop_token_ids (List[int], optional): A list of token ids that stop the generation when they are generated. Defaults to None. |
| 133 | include_stop_str_in_output (bool): Whether to include the stop strings in output text. Defaults to False. |
| 134 | embedding_bias (torch.Tensor, optional): The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None. |
| 135 | logits_processor (tensorrt_llm.sampling_params.LogitsProcessor, List[tensorrt_llm.sampling_params.LogitsProcessor], optional): The logits postprocessor callback(s). Defaults to None. |
| 136 | If a list, each processor is applied in order during generation (supported in PyTorch backend only). |
| 137 | apply_batched_logits_processor (bool): Whether to apply batched logits postprocessor callback. Defaults to False. |
| 138 | The BatchedLogitsProcessor class is recommended for callback creation. The callback must be provided when initializing LLM. |
| 139 | |
| 140 | n (int): Number of sequences to generate. Defaults to 1. |
| 141 | best_of (int, optional): Number of sequences to consider for best output. Defaults to None. |
| 142 | use_beam_search (bool): Whether to use beam search. Defaults to False. |
| 143 | |
| 144 | top_k (int, optional): Controls number of logits to sample from. Can assume non-negative values, where 0 means 'all logits'. Defaults to None. |
| 145 | The value None is treated as "not specified" in the following. |
| 146 | If neither temperature, top_p, nor top_k are specified, sampling is greedy. |
| 147 | If temperature > 0 and/or top_p < 1 are specified, sampling will proceed accordingly and top_k will default to top_k = 0. |
| 148 | Setting top_k = 1 results in greedy sampling. |
| 149 | top_p (float, optional): Controls the top-P probability to sample from. Can have values between 0 and 1. Defaults to None. |
| 150 | The value None is treated as "not specified" in the following. |
| 151 | If neither temperature, top_p, nor top_k are specified, sampling is greedy. |
| 152 | If temperature > 0 and/or top_k > 1 are specified, sampling will proceed accordingly and top_p will default to top_p = 1. |
| 153 | Setting top_p = 0 should result in greedy sampling, but is currently disallowed in the backend. |
| 154 | top_p_min (float, optional): Controls decay in the top-P algorithm. topPMin is lower-bound. None means using C++ runtime default 1.e-6. Defaults to None. |
| 155 | top_p_reset_ids (int, optional): Controls decay in the top-P algorithm. Indicates where to reset the decay. None means using C++ runtime default 1. Defaults to None. |
| 156 | top_p_decay (float, optional): Controls decay in the top-P algorithm. The decay value. None means using C++ runtime default 1.f. Defaults to None. |
| 157 | seed (int, optional): Controls the random seed used by the random number generator in sampling. None means using C++ runtime default 0. Defaults to None. |
| 158 | temperature (float, optional): Controls the modulation of logits when sampling new tokens. It can have values >= 0.f. Defaults to None. |
| 159 | The value None is treated as "not specified" in the following. |
| 160 | If neither temperature, top_p, nor top_k are specified, sampling is greedy. |
| 161 | If top_p < 1 and/or top_k > 1 are specified, sampling will proceed accordingly and temperature will default to temperature = 1. |
| 162 | Setting temperature = 0 results in greedy sampling. |
| 163 | min_tokens (int, optional): Lower bound on the number of tokens to generate. Values < 1 have no effect. None means using C++ runtime default 1. Defaults to None. |
| 164 | beam_search_diversity_rate (float, optional): Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None. |
| 165 | repetition_penalty (float, optional): Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None. |
| 166 | presence_penalty (float, optional): Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None. |
| 167 | frequency_penalty (float, optional): Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None. |
| 168 | prompt_ignore_length (int, optional): Controls how many tokens to ignore from the prompt for presence and frequency penalties. Values <= 0 have no effect. Values > input (prompt) length will be clamped. None means using C++ runtime default 0. Defaults to None. |
| 169 | length_penalty (float, optional): Controls how to penalize longer sequences in beam search. None means using C++ runtime default 0.f. Defaults to None. |
| 170 | early_stopping (int, optional): Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token). None means using C++ runtime default 1. Defaults to None. |
no outgoing calls