hub / github.com/mudler/LocalAI / PredictStream

Method PredictStream

backend/python/mlx/backend.py:291–392 · view source on GitHub ↗

Generates text based on the given prompt and sampling parameters, and streams the results using MLX. Uses thread-safe LRU prompt cache for efficient prefix reuse across requests. Args: request: The predict stream request. context: The gRPC context.

(self, request, context)

Source from the content-addressed store, hash-verified

289	return backend_pb2.Result(success=False, message=str(e))
290
291	async def PredictStream(self, request, context):
292	"""
293	Generates text based on the given prompt and sampling parameters, and streams the results using MLX.
294
295	Uses thread-safe LRU prompt cache for efficient prefix reuse across requests.
296
297	Args:
298	request: The predict stream request.
299	context: The gRPC context.
300
301	Yields:
302	backend_pb2.Reply: Streaming predict results.
303	"""
304	prompt_cache = None
305	cache_key = None
306
307	try:
308	# Prepare the prompt and tokenize for cache key
309	prompt_text = self._prepare_prompt(request)
310	cache_key = self._get_tokens_from_prompt(prompt_text)
311
312	# Fetch nearest cache (exact, shorter prefix, or create new)
313	prompt_cache, remaining_tokens = self.lru_cache.fetch_nearest_cache(
314	self.model_key, cache_key
315	)
316	if prompt_cache is None:
317	prompt_cache = make_prompt_cache(self.model, self.max_kv_size)
318	remaining_tokens = cache_key
319
320	# Build generation parameters using request attributes and options
321	max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(
322	request, default_max_tokens=512
323	)
324
325	print(
326	f"Streaming text with MLX - max_tokens: {max_tokens}, "
327	f"cache_hit: {len(remaining_tokens) < len(cache_key)}",
328	file=sys.stderr,
329	)
330
331	# Create sampler and optional logits processors (penalties)
332	sampler = make_sampler(**sampler_params)
333	logits_processors = make_logits_processors(**logits_params) if logits_params else None
334
335	accumulated = []
336	last_response = None
337	for response in stream_generate(
338	self.model,
339	self.tokenizer,
340	prompt=remaining_tokens if remaining_tokens else cache_key,
341	max_tokens=max_tokens,
342	sampler=sampler,
343	logits_processors=logits_processors,
344	prompt_cache=prompt_cache,
345	):
346	cache_key.append(response.token)
347	accumulated.append(response.text)
348	last_response = response

Callers

nothing calls this directly

Calls 10

_prepare_promptMethod · 0.95

_get_tokens_from_promptMethod · 0.95

_build_generation_paramsMethod · 0.95

_truncate_at_stopMethod · 0.95

_finalize_outputMethod · 0.95

appendMethod · 0.80

fetch_nearest_cacheMethod · 0.45

set_codeMethod · 0.45

set_detailsMethod · 0.45

insert_cacheMethod · 0.45

Tested by

no test coverage detected