MCPcopy
hub / github.com/zai-org/GLM-4 / generate_stream

Method generate_stream

demo/composite_demo/src/clients/vllm.py:30–51  ·  view source on GitHub ↗
(
        self, tools: list[dict], history: list[Conversation], **parameters
    )

Source from the content-addressed store, hash-verified

28 self.engine = LLMEngine.from_engine_args(self.engine_args)
29
30 def generate_stream(
31 self, tools: list[dict], history: list[Conversation], **parameters
32 ) -> Generator[tuple[str | dict, list[dict]]]:
33 chat_history = process_input(history, tools)
34 model_inputs = self.tokenizer.apply_chat_template(chat_history, add_generation_prompt=True, tokenize=False)
35 parameters["max_tokens"] = parameters.pop("max_new_tokens")
36 params_dict = {
37 "n": 1,
38 "best_of": 1,
39 "top_p": 1,
40 "top_k": -1,
41 "length_penalty": 1,
42 "stop_token_ids": [151329, 151336, 151338],
43 }
44 params_dict.update(parameters)
45 sampling_params = SamplingParams(**params_dict)
46
47 self.engine.add_request(request_id=str(time.time()), inputs=model_inputs, params=sampling_params)
48 while self.engine.has_unfinished_requests():
49 request_outputs = self.engine.step()
50 for request_output in request_outputs:
51 yield process_response(request_output.outputs[0].text, chat_history)

Callers

nothing calls this directly

Calls 2

process_inputFunction · 0.90
process_responseFunction · 0.90

Tested by

no test coverage detected