hub / github.com/GetStream/Vision-Agents

github.com/GetStream/Vision-Agents @v0.6.6 sqlite

repository ↗ · DeepWiki ↗ · release v0.6.6 ↗

4,953 symbols 22,268 edges 531 files 2,379 documented · 48%

README

VisionAgents

Open Vision Agents by Stream

PyPI - Python Version

Multi-modal AI agents that watch, listen, and understand video.

Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

Key Highlights

Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
Open: Built by Stream, but works with any video edge network.
Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate), and Claude ( create message) — always access the latest LLM capabilities.
SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.

Getting Started

Step 1: Install via uv

uv add vision-agents

Step 2: (Optional) Install with extra integrations

uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"

Step 3: Obtain your Stream API credentials

Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.

Follow the quickstart guide to build your first agent.

See It In Action

https://github.com/user-attachments/assets/d1258ac2-ca98-4019-80e4-41ec5530117e

This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=gemini.Realtime(fps=10),
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)

Features

Feature	Description
Real-time WebRTC	Stream video directly to model providers for instant visual understanding.
Video Processing	Pluggable processor pipeline for YOLO, Roboflow, or custom PyTorch/ONNX models before/after LLM calls.
Turn Detection	Natural conversation flow with VAD, diarization, and smart turn-taking.
Tool Calling & MCP	Execute code and APIs mid-conversation — Linear issues, weather, telephony, or any MCP server.
Phone Integration	Inbound and outbound voice calls via Twilio or Telnyx with bidirectional audio streaming.
RAG	Retrieval-augmented generation with TurboPuffer/Qdrant vector search or Gemini FileSearch.
Memory	Agents recall context across turns and sessions via Stream Chat.
Text Back-channel	Message the agent silently during a call — coaching overlays, silent instructions, etc.
Production Ready	Built-in HTTP server, Prometheus metrics, horizontal scaling, and Kubernetes deployment.

Out-of-the-Box Integrations

LLMs: OpenAI · Gemini · xAI · OpenRouter · Hugging Face · Kimi AI · MiniMax

Realtime: OpenAI Realtime · Gemini Live · AWS Nova Sonic · Qwen · Inworld

STT: Deepgram · AssemblyAI · Fast-Whisper · Fish Audio · Wizper · Mistral Voxtral

TTS: ElevenLabs · Cartesia · Deepgram · AWS Polly · Pocket · Kokoro · Inworld · Fish Audio

Vision: Ultralytics · Roboflow · Moondream · TwelveLabs · NVIDIA Cosmos · Decart

Avatars: LemonSlice

Turn Detection: Vogent · Smart Turn

Other: Twilio · Telnyx · TurboPuffer

Documentation

Check out the full docs at VisionAgents.ai.

Quickstart: Voice AI · Video AI

Guides: MCP & Function Calling · Video Processors · Phone Calling · RAG · Testing

Production: HTTP Server · Deployment · Kubernetes · Horizontal Scaling · Prometheus Metrics

Examples

🔮 Demo Applications

Voice Agents (Low Latency + RAG + File Search)

Build fast voice agents that can reason over knowledge, search files, and respond in real time.

• Low-latency voice interactions

• Retrieval-augmented responses

• File and knowledge search

>Source Code and tutorial | Voice Agent Demo | |

Realtime Coaching and Video Understanding

Power interactive coaching flows with live pose tracking and processor pipelines for frame-by-frame understanding.

• Real-time pose tracking

• Actionable coaching feedback

• Video processor pipeline support

>Source Code and tutorial | Realtime Coaching Demo | |

Video Restyling and Avatars

Use models like Decart Lucy to build virtual try-ons, stylized scenes, or give your agents a visual identity.

• Real-time video restyling

• Virtual try-on experiences

• Avatar-like visual presence

>Source Code and tutorial | Video Restyling Demo | |

Custom Video Models (Roboflow, YOLO, and More)

Train and run custom computer vision models for security monitoring, moderation, and other domain-specific workflows.

• Bring your own CV models

• Real-time moderation pipelines

• Security and detection use cases

>Source Code and tutorial | Custom Video Models Demo | |

Tools, MCP, and Phone Calling

Connect external APIs and services so agents can validate data and take real-world actions during live conversations.

• MCP and function calling support

• Twilio and Telnyx phone workflows

• Real-time fraud response automation

>Phone + RAG example · >Telnyx phone examples · >Fraud workflow example | Tools and Phone Demo |

Community Highlights

More involved demos built by the community and the Stream team - full applications that go beyond the in-repo examples and show what's possible with Vision Agents in production.

Got a demo you'd like featured? Open a PR or reach out on Discord.

Sales Assistant Demo - a real-time AI meeting coach that lives on your desktop as a translucent macOS overlay. Built on Vision Agents and Flutter.
Crashout Buddy - an emotionally aware voice agent demo built on Vision Agents and Stream Video.
Cricket DRS AI — AI-powered Decision Review System for 🏏 Women's Cricket using Gemini Live vision, YOLO pose detection, and real-time voice verdicts by @jaya6400.

Development

See DEVELOPMENT.md

Want to add your platform or provider? See Create Your Own Plugin or reach out to nash@getstream.io.

Current Limitations

Video AI struggles with small text — models may hallucinate scores, signs, etc.
Context degrades on longer sessions (~30s+) for continuous video understanding
Most use cases need a mix of specialized models (YOLO, Roboflow) with larger LLMs
Real-time models require audio/text to trigger responses — video alone won't prompt output

Star History

Core symbols most depended-on inside this repo

get

called by 404

plugins/twilio/vision_agents/plugins/twilio/call_registry.py

send

called by 210

agents-core/vision_agents/core/utils/stream.py

get

called by 134

agents-core/vision_agents/core/utils/stream.py

update

called by 124

agents-core/vision_agents/core/utils/tokenizer.py

join

called by 121

plugins/local/vision_agents/plugins/local/edge.py

peek

called by 84

agents-core/vision_agents/core/utils/stream.py

send_nowait

called by 82

agents-core/vision_agents/core/utils/stream.py

set

called by 81

agents-core/vision_agents/core/agents/session_registry/store.py

Shape

Method 3,492

Function 714

Class 687

Route 60

Languages

Python100%

Modules by API surface

plugins/getstream/vision_agents/plugins/getstream/sfu_events.py295 symbols

tests/test_agents/test_agents.py82 symbols

tests/test_agents/test_inference/test_transcribing_flow.py68 symbols

tests/test_function_calling.py62 symbols

agents-core/vision_agents/core/agents/agents.py58 symbols

plugins/gemini/tests/test_gemini_realtime.py57 symbols

examples/05_security_camera_example/security_camera_processor.py50 symbols

tests/test_events.py48 symbols

tests/test_utils/test_stream.py47 symbols

tests/test_agents/test_runner.py46 symbols

tests/test_agents/test_agent_launcher.py45 symbols

tests/test_tts_base.py39 symbols

Dependencies from manifests, versioned

aiohttp3.13.3 · 1×

aiortc1.9 · 1×

anthropic0.66.0 · 1×

av12.0.0 · 1×

cryptography44.0.0 · 1×

elevenlabs2.38.1 · 1×

face-recognition1.3.0 · 1×

fal-client0.13.1 · 1×

fastapi0.135.1 · 1×

getstream1×

google-genai1.33.0 · 1×

httpx0.28.1 · 1×

For agents

$ claude mcp add Vision-Agents \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact