MCPcopy
hub / github.com/huggingface/smolagents / InferenceClientModel

Class InferenceClientModel

src/smolagents/models.py:1456–1643  ·  view source on GitHub ↗

A class to interact with Hugging Face's Inference Providers for language model interaction. This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features

Source from the content-addressed store, hash-verified

1454
1455
1456class InferenceClientModel(ApiModel):
1457 """A class to interact with Hugging Face's Inference Providers for language model interaction.
1458
1459 This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features like stop sequences and grammar customization.
1460
1461 Providers include Cerebras, Cohere, Fal, Fireworks, HF-Inference, Hyperbolic, Nebius, Novita, Replicate, SambaNova, Together, and more.
1462
1463 Parameters:
1464 model_id (`str`, *optional*, default `"Qwen/Qwen3-Next-80B-A3B-Thinking"`):
1465 The Hugging Face model ID to be used for inference.
1466 This can be a model identifier from the Hugging Face model hub or a URL to a deployed Inference Endpoint.
1467 Currently, it defaults to `"Qwen/Qwen3-Next-80B-A3B-Thinking"`, but this may change in the future.
1468 provider (`str`, *optional*):
1469 Name of the provider to use for inference. A list of supported providers can be found in the [Inference Providers documentation](https://huggingface.co/docs/inference-providers/index#partners).
1470 Defaults to "auto" i.e. the first of the providers available for the model, sorted by the user's order [here](https://hf.co/settings/inference-providers).
1471 If `base_url` is passed, then `provider` is not used.
1472 token (`str`, *optional*):
1473 Token used by the Hugging Face API for authentication. This token need to be authorized 'Make calls to the serverless Inference Providers'.
1474 If the model is gated (like Llama-3 models), the token also needs 'Read access to contents of all public gated repos you can access'.
1475 If not provided, the class will try to use environment variable 'HF_TOKEN', else use the token stored in the Hugging Face CLI configuration.
1476 timeout (`int`, *optional*, defaults to 120):
1477 Timeout for the API request, in seconds.
1478 client_kwargs (`dict[str, Any]`, *optional*):
1479 Additional keyword arguments to pass to the Hugging Face InferenceClient.
1480 custom_role_conversions (`dict[str, str]`, *optional*):
1481 Custom role conversion mapping to convert message roles in others.
1482 Useful for specific models that do not support specific message roles like "system".
1483 api_key (`str`, *optional*):
1484 Token to use for authentication. This is a duplicated argument from `token` to make [`InferenceClientModel`]
1485 follow the same pattern as `openai.OpenAI` client. Cannot be used if `token` is set. Defaults to None.
1486 bill_to (`str`, *optional*):
1487 The billing account to use for the requests. By default the requests are billed on the user's account. Requests can only be billed to
1488 an organization the user is a member of, and which has subscribed to Enterprise Hub.
1489 base_url (`str`, `optional`):
1490 Base URL to run inference. This is a duplicated argument from `model` to make [`InferenceClientModel`]
1491 follow the same pattern as `openai.OpenAI` client. Cannot be used if `model` is set. Defaults to None.
1492 **kwargs:
1493 Additional keyword arguments to forward to the underlying Hugging Face InferenceClient completion call.
1494
1495 Raises:
1496 ValueError:
1497 If the model name is not provided.
1498
1499 Example:
1500 ```python
1501 >>> engine = InferenceClientModel(
1502 ... model_id="Qwen/Qwen3-Next-80B-A3B-Thinking",
1503 ... provider="hyperbolic",
1504 ... token="your_hf_token_here",
1505 ... max_tokens=5000,
1506 ... )
1507 >>> messages = [{"role": "user", "content": "Explain quantum mechanics in simple terms."}]
1508 >>> response = engine(messages, stop_sequences=["END"])
1509 >>> print(response)
1510 "Quantum mechanics is the branch of physics that studies..."
1511 ```
1512 """
1513

Calls

no outgoing calls

Used in the wild real call sites across dependent graphs

searching dependent graphs…