hub / github.com/NVIDIA/TensorRT-LLM / SamplingParams

Class SamplingParams

tensorrt_llm/sampling_params.py:113–552 · view source on GitHub ↗

Sampling parameters for text generation. Usage Examples: use_beam_search is False: - best_of is None: (top-p/top-k) sampling n responses and return n generations - best_of is not None: (top-p/top-k) sampling best_of responses and return n generations (best_of >=

Source from the content-addressed store, hash-verified

111
112	@dataclass(slots=True, kw_only=True)
113	class SamplingParams:
114	"""Sampling parameters for text generation.
115
116	Usage Examples:
117
118	use_beam_search is False:
119	- best_of is None: (top-p/top-k) sampling n responses and return n generations
120	- best_of is not None: (top-p/top-k) sampling best_of responses and return n generations (best_of >= n must hold)
121	use_beam_search is True:
122	- best_of is None: beam search with beam width of n, return n generations
123	- best_of is not None: beam search with beam width of best_of, return n generations (best_of >= n must hold)
124
125	Args:
126	end_id (int, optional): The end token id. Defaults to None.
127	pad_id (int, optional): The pad token id. Defaults to None.
128	max_tokens (int): The maximum number of tokens to generate. Defaults to 32.
129	bad (str, List[str], optional): A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None.
130	bad_token_ids (List[int], optional): A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None.
131	stop (str, List[str], optional): A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None.
132	stop_token_ids (List[int], optional): A list of token ids that stop the generation when they are generated. Defaults to None.
133	include_stop_str_in_output (bool): Whether to include the stop strings in output text. Defaults to False.
134	embedding_bias (torch.Tensor, optional): The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None.
135	logits_processor (tensorrt_llm.sampling_params.LogitsProcessor, List[tensorrt_llm.sampling_params.LogitsProcessor], optional): The logits postprocessor callback(s). Defaults to None.
136	If a list, each processor is applied in order during generation (supported in PyTorch backend only).
137	apply_batched_logits_processor (bool): Whether to apply batched logits postprocessor callback. Defaults to False.
138	The BatchedLogitsProcessor class is recommended for callback creation. The callback must be provided when initializing LLM.
139
140	n (int): Number of sequences to generate. Defaults to 1.
141	best_of (int, optional): Number of sequences to consider for best output. Defaults to None.
142	use_beam_search (bool): Whether to use beam search. Defaults to False.
143
144	top_k (int, optional): Controls number of logits to sample from. Can assume non-negative values, where 0 means 'all logits'. Defaults to None.
145	The value None is treated as "not specified" in the following.
146	If neither temperature, top_p, nor top_k are specified, sampling is greedy.
147	If temperature > 0 and/or top_p < 1 are specified, sampling will proceed accordingly and top_k will default to top_k = 0.
148	Setting top_k = 1 results in greedy sampling.
149	top_p (float, optional): Controls the top-P probability to sample from. Can have values between 0 and 1. Defaults to None.
150	The value None is treated as "not specified" in the following.
151	If neither temperature, top_p, nor top_k are specified, sampling is greedy.
152	If temperature > 0 and/or top_k > 1 are specified, sampling will proceed accordingly and top_p will default to top_p = 1.
153	Setting top_p = 0 should result in greedy sampling, but is currently disallowed in the backend.
154	top_p_min (float, optional): Controls decay in the top-P algorithm. topPMin is lower-bound. None means using C++ runtime default 1.e-6. Defaults to None.
155	top_p_reset_ids (int, optional): Controls decay in the top-P algorithm. Indicates where to reset the decay. None means using C++ runtime default 1. Defaults to None.
156	top_p_decay (float, optional): Controls decay in the top-P algorithm. The decay value. None means using C++ runtime default 1.f. Defaults to None.
157	seed (int, optional): Controls the random seed used by the random number generator in sampling. None means using C++ runtime default 0. Defaults to None.
158	temperature (float, optional): Controls the modulation of logits when sampling new tokens. It can have values >= 0.f. Defaults to None.
159	The value None is treated as "not specified" in the following.
160	If neither temperature, top_p, nor top_k are specified, sampling is greedy.
161	If top_p < 1 and/or top_k > 1 are specified, sampling will proceed accordingly and temperature will default to temperature = 1.
162	Setting temperature = 0 results in greedy sampling.
163	min_tokens (int, optional): Lower bound on the number of tokens to generate. Values < 1 have no effect. None means using C++ runtime default 1. Defaults to None.
164	beam_search_diversity_rate (float, optional): Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
165	repetition_penalty (float, optional): Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
166	presence_penalty (float, optional): Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
167	frequency_penalty (float, optional): Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
168	prompt_ignore_length (int, optional): Controls how many tokens to ignore from the prompt for presence and frequency penalties. Values <= 0 have no effect. Values > input (prompt) length will be clamped. None means using C++ runtime default 0. Defaults to None.
169	length_penalty (float, optional): Controls how to penalize longer sequences in beam search. None means using C++ runtime default 0.f. Defaults to None.
170	early_stopping (int, optional): Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token). None means using C++ runtime default 1. Defaults to None.

Callers 15

test_llm_torch_multi_lora_supportFunction · 0.90

test_ptp_quickstart_bertFunction · 0.90

test_llmapi_generation_logitsFunction · 0.90

test_eagle3_output_consistency_4gpusFunction · 0.90

test_connector_simpleFunction · 0.90

test_connector_async_onboardFunction · 0.90

test_connector_async_saveFunction · 0.90

test_connector_scheduler_outputFunction · 0.90

test_connector_scheduler_output_chunked_contextFunction · 0.90

test_connector_disagg_prefillFunction · 0.90

test_connector_multi_requestFunction · 0.90

verify_disaggregatedFunction · 0.90

Calls

no outgoing calls

Tested by 15

test_llm_torch_multi_lora_supportFunction · 0.72

test_ptp_quickstart_bertFunction · 0.72

test_llmapi_generation_logitsFunction · 0.72

test_eagle3_output_consistency_4gpusFunction · 0.72

test_connector_simpleFunction · 0.72

test_connector_async_onboardFunction · 0.72

test_connector_async_saveFunction · 0.72

test_connector_scheduler_outputFunction · 0.72

test_connector_scheduler_output_chunked_contextFunction · 0.72

test_connector_disagg_prefillFunction · 0.72

test_connector_multi_requestFunction · 0.72

verify_disaggregatedFunction · 0.72