A class to interact with Hugging Face's Inference Providers for language model interaction. This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features
| 1454 | |
| 1455 | |
| 1456 | class InferenceClientModel(ApiModel): |
| 1457 | """A class to interact with Hugging Face's Inference Providers for language model interaction. |
| 1458 | |
| 1459 | This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features like stop sequences and grammar customization. |
| 1460 | |
| 1461 | Providers include Cerebras, Cohere, Fal, Fireworks, HF-Inference, Hyperbolic, Nebius, Novita, Replicate, SambaNova, Together, and more. |
| 1462 | |
| 1463 | Parameters: |
| 1464 | model_id (`str`, *optional*, default `"Qwen/Qwen3-Next-80B-A3B-Thinking"`): |
| 1465 | The Hugging Face model ID to be used for inference. |
| 1466 | This can be a model identifier from the Hugging Face model hub or a URL to a deployed Inference Endpoint. |
| 1467 | Currently, it defaults to `"Qwen/Qwen3-Next-80B-A3B-Thinking"`, but this may change in the future. |
| 1468 | provider (`str`, *optional*): |
| 1469 | Name of the provider to use for inference. A list of supported providers can be found in the [Inference Providers documentation](https://huggingface.co/docs/inference-providers/index#partners). |
| 1470 | Defaults to "auto" i.e. the first of the providers available for the model, sorted by the user's order [here](https://hf.co/settings/inference-providers). |
| 1471 | If `base_url` is passed, then `provider` is not used. |
| 1472 | token (`str`, *optional*): |
| 1473 | Token used by the Hugging Face API for authentication. This token need to be authorized 'Make calls to the serverless Inference Providers'. |
| 1474 | If the model is gated (like Llama-3 models), the token also needs 'Read access to contents of all public gated repos you can access'. |
| 1475 | If not provided, the class will try to use environment variable 'HF_TOKEN', else use the token stored in the Hugging Face CLI configuration. |
| 1476 | timeout (`int`, *optional*, defaults to 120): |
| 1477 | Timeout for the API request, in seconds. |
| 1478 | client_kwargs (`dict[str, Any]`, *optional*): |
| 1479 | Additional keyword arguments to pass to the Hugging Face InferenceClient. |
| 1480 | custom_role_conversions (`dict[str, str]`, *optional*): |
| 1481 | Custom role conversion mapping to convert message roles in others. |
| 1482 | Useful for specific models that do not support specific message roles like "system". |
| 1483 | api_key (`str`, *optional*): |
| 1484 | Token to use for authentication. This is a duplicated argument from `token` to make [`InferenceClientModel`] |
| 1485 | follow the same pattern as `openai.OpenAI` client. Cannot be used if `token` is set. Defaults to None. |
| 1486 | bill_to (`str`, *optional*): |
| 1487 | The billing account to use for the requests. By default the requests are billed on the user's account. Requests can only be billed to |
| 1488 | an organization the user is a member of, and which has subscribed to Enterprise Hub. |
| 1489 | base_url (`str`, `optional`): |
| 1490 | Base URL to run inference. This is a duplicated argument from `model` to make [`InferenceClientModel`] |
| 1491 | follow the same pattern as `openai.OpenAI` client. Cannot be used if `model` is set. Defaults to None. |
| 1492 | **kwargs: |
| 1493 | Additional keyword arguments to forward to the underlying Hugging Face InferenceClient completion call. |
| 1494 | |
| 1495 | Raises: |
| 1496 | ValueError: |
| 1497 | If the model name is not provided. |
| 1498 | |
| 1499 | Example: |
| 1500 | ```python |
| 1501 | >>> engine = InferenceClientModel( |
| 1502 | ... model_id="Qwen/Qwen3-Next-80B-A3B-Thinking", |
| 1503 | ... provider="hyperbolic", |
| 1504 | ... token="your_hf_token_here", |
| 1505 | ... max_tokens=5000, |
| 1506 | ... ) |
| 1507 | >>> messages = [{"role": "user", "content": "Explain quantum mechanics in simple terms."}] |
| 1508 | >>> response = engine(messages, stop_sequences=["END"]) |
| 1509 | >>> print(response) |
| 1510 | "Quantum mechanics is the branch of physics that studies..." |
| 1511 | ``` |
| 1512 | """ |
| 1513 |
no outgoing calls
searching dependent graphs…