hub / github.com/huggingface/smolagents / InferenceClientModel

Class InferenceClientModel

src/smolagents/models.py:1456–1643 · view source on GitHub ↗

A class to interact with Hugging Face's Inference Providers for language model interaction. This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features

Source from the content-addressed store, hash-verified

1454
1455
1456	class InferenceClientModel(ApiModel):
1457	"""A class to interact with Hugging Face's Inference Providers for language model interaction.
1458
1459	This model allows you to communicate with Hugging Face's models using Inference Providers. It can be used in both serverless mode, with a dedicated endpoint, or even with a local URL, supporting features like stop sequences and grammar customization.
1460
1461	Providers include Cerebras, Cohere, Fal, Fireworks, HF-Inference, Hyperbolic, Nebius, Novita, Replicate, SambaNova, Together, and more.
1462
1463	Parameters:
1464	model_id (`str`, optional, default `"Qwen/Qwen3-Next-80B-A3B-Thinking"`):
1465	The Hugging Face model ID to be used for inference.
1466	This can be a model identifier from the Hugging Face model hub or a URL to a deployed Inference Endpoint.
1467	Currently, it defaults to `"Qwen/Qwen3-Next-80B-A3B-Thinking"`, but this may change in the future.
1468	provider (`str`, optional):
1469	Name of the provider to use for inference. A list of supported providers can be found in the [Inference Providers documentation](https://huggingface.co/docs/inference-providers/index#partners).
1470	Defaults to "auto" i.e. the first of the providers available for the model, sorted by the user's order [here](https://hf.co/settings/inference-providers).
1471	If `base_url` is passed, then `provider` is not used.
1472	token (`str`, optional):
1473	Token used by the Hugging Face API for authentication. This token need to be authorized 'Make calls to the serverless Inference Providers'.
1474	If the model is gated (like Llama-3 models), the token also needs 'Read access to contents of all public gated repos you can access'.
1475	If not provided, the class will try to use environment variable 'HF_TOKEN', else use the token stored in the Hugging Face CLI configuration.
1476	timeout (`int`, optional, defaults to 120):
1477	Timeout for the API request, in seconds.
1478	client_kwargs (`dict[str, Any]`, optional):
1479	Additional keyword arguments to pass to the Hugging Face InferenceClient.
1480	custom_role_conversions (`dict[str, str]`, optional):
1481	Custom role conversion mapping to convert message roles in others.
1482	Useful for specific models that do not support specific message roles like "system".
1483	api_key (`str`, optional):
1484	Token to use for authentication. This is a duplicated argument from `token` to make [`InferenceClientModel`]
1485	follow the same pattern as `openai.OpenAI` client. Cannot be used if `token` is set. Defaults to None.
1486	bill_to (`str`, optional):
1487	The billing account to use for the requests. By default the requests are billed on the user's account. Requests can only be billed to
1488	an organization the user is a member of, and which has subscribed to Enterprise Hub.
1489	base_url (`str`, `optional`):
1490	Base URL to run inference. This is a duplicated argument from `model` to make [`InferenceClientModel`]
1491	follow the same pattern as `openai.OpenAI` client. Cannot be used if `model` is set. Defaults to None.
1492	**kwargs:
1493	Additional keyword arguments to forward to the underlying Hugging Face InferenceClient completion call.
1494
1495	Raises:
1496	ValueError:
1497	If the model name is not provided.
1498
1499	Example:
1500	```python
1501	>>> engine = InferenceClientModel(
1502	... model_id="Qwen/Qwen3-Next-80B-A3B-Thinking",
1503	... provider="hyperbolic",
1504	... token="your_hf_token_here",
1505	... max_tokens=5000,
1506	... )
1507	>>> messages = [{"role": "user", "content": "Explain quantum mechanics in simple terms."}]
1508	>>> response = engine(messages, stop_sequences=["END"])
1509	>>> print(response)
1510	"Quantum mechanics is the branch of physics that studies..."
1511	```
1512	"""
1513

Callers 15

load_modelFunction · 0.90

test_call_with_custom_role_conversionsMethod · 0.90

test_init_model_with_tokensMethod · 0.90

test_structured_outputs_with_unsupported_providerMethod · 0.90

test_get_hfapi_message_no_toolMethod · 0.90

test_get_hfapi_message_no_tool_external_providerMethod · 0.90

test_get_hfapi_message_stream_no_toolMethod · 0.90

test_get_hfapi_message_stream_no_tool_external_providerMethod · 0.90

test_toolcalling_agent_apiMethod · 0.90

test_toolcalling_agent_api_misformatted_outputMethod · 0.90

test_multiagents_saveMethod · 0.90

test_modelMethod · 0.90

Calls

no outgoing calls

Tested by 11

test_call_with_custom_role_conversionsMethod · 0.72

test_init_model_with_tokensMethod · 0.72

test_structured_outputs_with_unsupported_providerMethod · 0.72

test_get_hfapi_message_no_toolMethod · 0.72

test_get_hfapi_message_no_tool_external_providerMethod · 0.72

test_get_hfapi_message_stream_no_toolMethod · 0.72

test_get_hfapi_message_stream_no_tool_external_providerMethod · 0.72

test_toolcalling_agent_apiMethod · 0.72

test_toolcalling_agent_api_misformatted_outputMethod · 0.72

test_multiagents_saveMethod · 0.72

test_modelMethod · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…