
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
create response), Gemini (generate), and Claude (
create message) — always access the latest LLM capabilities.Step 1: Install via uv
uv add vision-agents
Step 2: (Optional) Install with extra integrations
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
Step 3: Obtain your Stream API credentials
Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
Follow the quickstart guide to build your first agent.
https://github.com/user-attachments/assets/d1258ac2-ca98-4019-80e4-41ec5530117e
This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
edge=getstream.Edge(),
agent_user=agent_user,
instructions="Read @golf_coach.md",
llm=gemini.Realtime(fps=10),
processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)
| Feature | Description |
|---|---|
| Real-time WebRTC | Stream video directly to model providers for instant visual understanding. |
| Video Processing | Pluggable processor pipeline for YOLO, Roboflow, or custom PyTorch/ONNX models before/after LLM calls. |
| Turn Detection | Natural conversation flow with VAD, diarization, and smart turn-taking. |
| Tool Calling & MCP | Execute code and APIs mid-conversation — Linear issues, weather, telephony, or any MCP server. |
| Phone Integration | Inbound and outbound voice calls via Twilio or Telnyx with bidirectional audio streaming. |
| RAG | Retrieval-augmented generation with TurboPuffer/Qdrant vector search or Gemini FileSearch. |
| Memory | Agents recall context across turns and sessions via Stream Chat. |
| Text Back-channel | Message the agent silently during a call — coaching overlays, silent instructions, etc. |
| Production Ready | Built-in HTTP server, Prometheus metrics, horizontal scaling, and Kubernetes deployment. |
LLMs: OpenAI · Gemini · xAI · OpenRouter · Hugging Face · Kimi AI · MiniMax
Realtime: OpenAI Realtime · Gemini Live · AWS Nova Sonic · Qwen · Inworld
STT: Deepgram · AssemblyAI · Fast-Whisper · Fish Audio · Wizper · Mistral Voxtral
TTS: ElevenLabs · Cartesia · Deepgram · AWS Polly · Pocket · Kokoro · Inworld · Fish Audio
Vision: Ultralytics · Roboflow · Moondream · TwelveLabs · NVIDIA Cosmos · Decart
Avatars: LemonSlice
Turn Detection: Vogent · Smart Turn
Other: Twilio · Telnyx · TurboPuffer
Check out the full docs at VisionAgents.ai.
Quickstart: Voice AI · Video AI
Guides: MCP & Function Calling · Video Processors · Phone Calling · RAG · Testing
Production: HTTP Server · Deployment · Kubernetes · Horizontal Scaling · Prometheus Metrics
| 🔮 Demo Applications | |
|---|---|
Build fast voice agents that can reason over knowledge, search files, and respond in real time.
• Low-latency voice interactions
• Retrieval-augmented responses
• File and knowledge search
>Source Code and tutorial |
|
|
Power interactive coaching flows with live pose tracking and processor pipelines for frame-by-frame understanding.
• Real-time pose tracking
• Actionable coaching feedback
• Video processor pipeline support
>Source Code and tutorial |
|
|
Use models like Decart Lucy to build virtual try-ons, stylized scenes, or give your agents a visual identity.
• Real-time video restyling
• Virtual try-on experiences
• Avatar-like visual presence
>Source Code and tutorial |
|
|
Train and run custom computer vision models for security monitoring, moderation, and other domain-specific workflows.
• Bring your own CV models
• Real-time moderation pipelines
• Security and detection use cases
>Source Code and tutorial |
|
|
Connect external APIs and services so agents can validate data and take real-world actions during live conversations.
• MCP and function calling support
• Twilio and Telnyx phone workflows
• Real-time fraud response automation
>Phone + RAG example · >Telnyx phone examples · >Fraud workflow example |
|
More involved demos built by the community and the Stream team - full applications that go beyond the in-repo examples and show what's possible with Vision Agents in production.
Got a demo you'd like featured? Open a PR or reach out on Discord.
See DEVELOPMENT.md
Want to add your platform or provider? See Create Your Own Plugin or reach out to nash@getstream.io.
$ claude mcp add Vision-Agents \
-- python -m otcore.mcp_server <graph>