hub / github.com/hpcaitech/ColossalAI / RequestHandler

Class RequestHandler

colossalai/inference/core/request_handler.py:140–402 · view source on GitHub ↗

RequestHandler is the core for handling existing requests and updating current batch. During generation process, we call schedule function each iteration to update current batch. Args: inference_config: Configuration for initialize and manage kv cache. model_config: Confi

Source from the content-addressed store, hash-verified

138
139
140	class RequestHandler(NaiveRequestHandler):
141	"""
142	RequestHandler is the core for handling existing requests and updating current batch.
143	During generation process, we call schedule function each iteration to update current batch.
144
145	Args:
146	inference_config: Configuration for initialize and manage kv cache.
147	model_config: Configuration for model
148	dtype (torch.dtype): The data type for weights and activations.
149	"""
150
151	def __init__(self, inference_config: InferenceConfig, model_config: PretrainedConfig) -> None:
152	self.inference_config = inference_config
153	self.running_list: RunningList = RunningList(inference_config.prefill_ratio)
154	self.waiting_list: List[List] = [[], [], []]
155	self.done_list: List[Sequence] = []
156	self.dtype = inference_config.dtype
157	self.max_batch_size = inference_config.max_batch_size
158
159	# initialize cache
160	self._init_cache(model_config)
161
162	# initialize batch
163	device = torch.cuda.current_device()
164	kv_max_split_num = (
165	inference_config.max_input_len + inference_config.max_output_len + inference_config.block_size - 1
166	) // inference_config.block_size
167	head_dim = model_config.hidden_size // model_config.num_attention_heads
168
169	fd_inter_tensor = FDIntermTensors()
170
171	if fd_inter_tensor._tensors_initialized:
172	fd_inter_tensor._reset()
173
174	# For Spec-Dec, process the speculated tokens plus the token in the last step for each seq
175	max_n_tokens = self.max_batch_size
176	max_n_tokens *= self.inference_config.max_n_spec_tokens + 1
177
178	fd_inter_tensor.initialize(
179	max_batch_size=max_n_tokens,
180	num_attn_heads=model_config.num_attention_heads // inference_config.tp_size,
181	kv_max_split_num=kv_max_split_num,
182	head_dim=head_dim,
183	dtype=self.dtype,
184	device=device,
185	)
186
187	# TODO In the continuous batching scenario, the batch size may be greater than max_batch_size,
188	# which may cause bugs and this issue should be fixed later.
189	self.running_bb = BatchBucket(
190	num_heads=model_config.num_attention_heads // inference_config.tp_size,
191	head_dim=head_dim,
192	max_batch_size=self.max_batch_size,
193	max_length=inference_config.max_input_len + inference_config.max_output_len,
194	block_size=inference_config.block_size,
195	kv_max_split_num=kv_max_split_num,
196	fd_interm_tensor=fd_inter_tensor,
197	dtype=self.dtype,

Callers 2

check_request_handlerFunction · 0.90

__init__Method · 0.85

Calls

no outgoing calls

Tested by 1

check_request_handlerFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…