Engine output from turbomind/pytorch engine. Args: status: the response type. token_ids: the newly generated token ids in each iteration. logprobs: the top logprobs for each output position. cache_block_ids: send cache blocks back for migration in
| 655 | |
| 656 | @dataclass |
| 657 | class EngineOutput: |
| 658 | """Engine output from turbomind/pytorch engine. |
| 659 | |
| 660 | Args: |
| 661 | status: the response type. |
| 662 | token_ids: the newly generated token ids in each iteration. |
| 663 | logprobs: the top logprobs for each output |
| 664 | position. |
| 665 | cache_block_ids: send cache blocks back for migration in |
| 666 | Disaggregated LLM Serving when Prefill Engine is Done. |
| 667 | req_metrics: request metrics information |
| 668 | """ |
| 669 | status: ResponseType |
| 670 | token_ids: list[int] |
| 671 | logprobs: list[dict[int, float]] = None |
| 672 | logits: torch.Tensor = None |
| 673 | last_hidden_state: torch.Tensor = None |
| 674 | cache_block_ids: list[int] | None = None |
| 675 | req_metrics: RequestMetrics | None = None |
| 676 | routed_experts: torch.Tensor = None |
| 677 | |
| 678 | |
| 679 | @dataclass |
no outgoing calls