hub / github.com/DepthAnything/Video-Depth-Anything

github.com/DepthAnything/Video-Depth-Anything @v1.3.1 sqlite

repository ↗ · DeepWiki ↗ · release v1.3.1 ↗

209 symbols 651 edges 35 files 25 documented · 12%

README

Video Depth Anything

Sili Chen · Hengkai Guo^† · Shengnan Zhu · Feihu Zhang

Zilong Huang · Jiashi Feng · Bingyi Kang^†

ByteDance

†Corresponding author

This work presents Video Depth Anything based on Depth Anything V2, which can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Compared with other diffusion-based models, it enjoys faster inference speed, fewer parameters, and higher consistent depth accuracy.

teaser

News

2025-09-12: Support streaming mode for metric depth models.
2025-08-28: Release ViT-base model for relative depth and ViT-small/base models for video metric depth.
2025-07-03: 🚀🚀🚀 Release an experimental version of training-free streaming video depth estimation.
2025-07-03: Release our implementation of training loss.
2025-04-25: 🌟🌟🌟 Release metric depth model based on Video-Depth-Anything-Large.
2025-04-05: Our paper has been accepted for a highlight presentation at CVPR 2025 (13.5% of the accepted papers).
2025-03-11: Add full dataset inference and evaluation scripts.
2025-02-08: Enable autocast inference. Support grayscale video, NPZ and EXR output formats.
2025-01-21: Paper, project page, code, models, and demo are all released.

Release Notes

2025-08-28: 🚀🚀🚀 Metric depth models released

δ1	MoGe-2-L	UniDepthV2-L	DepthPro	VDA-S-Metric	VDA-B-Metric	VDA-L-Metric
KITTI	0.415	0.982	0.822	0.877	0.887	0.910
NYUv2	0.967	0.989	0.953	0.850	0.883	0.908
TAE
Scannet	2.56	1.41	2.73	1.48	1.26	1.09

2025-02-08: 🚀🚀🚀 Inference speed and memory usage improvement

Model	Latency (ms)	GPU VRAM (GB)
FP32	FP16	FP32	FP16
Video-Depth-Anything-Small	9.1	7.5	7.3	6.8
Video-Depth-Anything-Large	67	14	26.7	23.6

The Latency and GPU VRAM results are obtained on a single A100 GPU with input of shape 1 x 32 x 518 × 518.

Pre-trained Models

We provide several models of varying scales for robust and consistent video depth estimation.

Relative Depth Model	Params	Checkpoint
Video-Depth-Anything-Small	28.4M	Download
Video-Depth-Anything-Base	113.1M	Download
Video-Depth-Anything-Large	381.8M	Download
Metric Depth Model	Params	Checkpoint
Metric-Video-Depth-Anything-Small	28.4M	Download
Metric-Video-Depth-Anything-Base	113.1M	Download
Metric-Video-Depth-Anything-Large	381.8M	Download

Usage

Preparation

git clone https://github.com/DepthAnything/Video-Depth-Anything
cd Video-Depth-Anything
pip install -r requirements.txt

Download the checkpoints listed here and put them under the checkpoints directory.

bash get_weights.sh

Run inference on a video

We support both relative depth and metric depth:

# For relative depth
python3 run.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs --encoder vitl

# For metric depth
python3 run.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs --encoder vitl --metric

Options: - --input_video: path of input video - --output_dir: path to save the output results - --input_size (optional): By default, we use input size 518 for model inference. - --max_res (optional): By default, we use maximum resolution 1280 for model inference. - --encoder (optional): vits for Video-Depth-Anything-Small, vitb for Video-Depth-Anything-Base, vitl for Video-Depth-Anything-Large. - --max_len (optional): maximum length of the input video, -1 means no limit - --target_fps (optional): target fps of the input video, -1 means the original fps - --metric (optional): use metric depth models trained on Virtual KITTI and IRS datasets - --fp32 (optional): Use fp32 precision for inference. By default, we use fp16. - --grayscale (optional): Save the grayscale depth map, without applying color palette. - --save_npz (optional): Save the depth map in npz format. - --save_exr (optional): Save the depth map in exr format.

Run inference on a video using streaming mode (Experimental features)

We implement an experimental streaming mode without training. In details, we save the hidden states of temporal attentions for each frames in the caches, and only send a single frame into our video depth model during inference by reusing these past hidden states in temporal attentions. We hack our pipeline to align the original inference setting in the offline mode. Due to the inevitable gap between training and testing, we observe a performance drop between the streaming model and the offline model (e.g. the d1 of ScanNet drops from 0.926 to 0.836). Finetuning the model in the streaming mode will greatly improve the performance. We leave it for future work.

To run the streaming model:

# For relative depth
python3 run_streaming.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs_streaming --encoder vitl

# For metric depth
python3 run_streaming.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs_streaming --encoder vitl --metric

Training Loss

Our training loss is in loss/ directory. Please see the loss/test_loss.py for usage.

Benchmark

Please refer to Benchmark.

Citation

If you find this project useful, please consider citing:

@article{video_depth_anything,
  title={Video Depth Anything: Consistent Depth Estimation for Super-Long Videos},
  author={Chen, Sili and Guo, Hengkai and Zhu, Shengnan and Zhang, Feihu and Huang, Zilong and Feng, Jiashi and Kang, Bingyi}
  journal={arXiv:2501.12375},
  year={2025}
}

LICENSE

Video-Depth-Anything-Small model is under the Apache-2.0 license. Video-Depth-Anything-Base/Large model is under the CC-BY-NC-4.0 license. For business cooperation, please send an email to Hengkai Guo at guohengkaighk@gmail.com.

Core symbols most depended-on inside this repo

reshape_heads_to_batch_dim

called by 10

video_depth_anything/motion_module/attention.py

gen_json

called by 8

benchmark/dataset_extract/eval_utils.py

get_sorted_files

called by 6

benchmark/dataset_extract/eval_utils.py

constrain_to_multiple_of

called by 6

video_depth_anything/util/transform.py

benchmark/dataset_extract/eval_utils.py

_make_fusion_block

called by 4

video_depth_anything/dpt.py

prepare_tokens_with_masks

called by 4

video_depth_anything/dinov2.py

Shape

Method 104

Function 68

Class 37

Languages

Python100%

Modules by API surface

video_depth_anything/motion_module/attention.py29 symbols

loss/loss.py22 symbols

video_depth_anything/dinov2.py21 symbols

video_depth_anything/motion_module/motion_module.py16 symbols

video_depth_anything/dinov2_layers/block.py15 symbols

video_depth_anything/util/transform.py11 symbols

benchmark/eval/metric.py11 symbols

video_depth_anything/util/blocks.py7 symbols

video_depth_anything/dpt.py7 symbols

video_depth_anything/video_depth_stream.py6 symbols

benchmark/eval/eval_tae.py6 symbols

benchmark/eval/eval.py6 symbols

Dependencies from manifests, versioned

OpenEXR3.3.1 · 1×

einops0.4.1 · 1×

imageio2.37.0 · 1×

imageio-ffmpeg0.4.7 · 1×

numpy1.24.0 · 1×

torch2.1.1 · 1×

torchvision0.16.1 · 1×

xformers0.0.23 · 1×

For agents

$ claude mcp add Video-Depth-Anything \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact