hub / github.com/Netflix/void-model

github.com/Netflix/void-model @main sqlite

751 symbols 2,538 edges 69 files 322 documented · 43% 19 cross-repo links

README

VOID: Video Object and Interaction Deletion

🎉 Accepted at ECCV 2026

[Saman Motamed](https://sam-motamed.github.io/)^1,2, [William Harvey](https://scholar.google.com/citations?user=kDd7nBkAAAAJ&hl=en)¹, [Benjamin Klein](https://scholar.google.com/citations?user=xkX9W9QAAAAJ&hl=en)¹, [Luc Van Gool](https://scholar.google.com/citations?user=TwMib_QAAAAJ&hl=en)², [Zhuoning Yuan](https://zhuoning.cc/)¹, [Ta-Ying Cheng](https://ttchengab.github.io/)¹ ¹Netflix ²INSAIT, Sofia University "St. Kliment Ohridski"

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.

Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.

TODO 📋

[ ] 🤗 Diffusers pipeline support

🤖 Models

VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.

Model	Description	HuggingFace
VOID Pass 1	Base inpainting model	Download
VOID Pass 2	Warped-noise refinement model	Download

Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).

▶️ Quick Start

The fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:

Note: Requires a GPU with 40GB+ VRAM (e.g., A100).

For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.

⚙️ Setup

pip install -r requirements.txt

Stage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:

export GEMINI_API_KEY=your_key_here

Also install SAM2+3 separately (required for mask generation):

git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e .

git clone https://github.com/facebookresearch/sam3.git
cd sam3 && pip install -e .

Download the pretrained base inpainting model from HuggingFace:

hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

The inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.

If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:

ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg

📁 Expected directory structure

After cloning the repo and downloading all assets, your directory should look like this:

VOID/
├── config/
├── datasets/
│   └── void_train_data.json
├── inference/
├── sample/                         # included sample sequences for inference
├── scripts/
├── videox_fun/
├── VLM-MASK-REASONER/
├── README.md
├── requirements.txt
│
├── CogVideoX-Fun-V1.5-5b-InP/     # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
├── void_pass1.safetensors          # download from huggingface.co/void-model (see Models above)
├── void_pass2.safetensors          # download from huggingface.co/void-model (see Models above)
├── training_data/                  # generated via data_generation/ pipeline (see Training section)
└── data_generation/                # data generation code (HUMOTO + Kubric pipelines)

📂 Input Format

Each video sequence lives in its own folder under a root data directory:

data_rootdir/
└── my-video/
    ├── input_video.mp4      # source video
    ├── quadmask_0.mp4       # quadmask (4-value mask video, see below)
    └── prompt.json          # {"bg": "background description"}

The prompt.json contains a single "bg" key describing the scene after the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.

{ "bg": "A table with a cup on it." }         // ✅ describes the clean background
{ "bg": "A person being removed from scene." } // ❌ don't describe the removal

A few examples from the included samples:

Sequence	Removed object	`bg` prompt
`lime`	the glass	`"A lime falls on the table."`
`moving_ball`	the rubber duckie	`"A ball rolls off the table."`
`pillow`	the kettlebell being placed on the pillow	`"Two pillows are on the table."`

The quadmask encodes four semantic regions per pixel:

Value	Meaning
`0`	Primary object to remove
`63`	Overlap of primary + affected regions
`127`	Affected region (interactions: falling objects, displaced items, etc.)
`255`	Background (keep)

🚀 Pipeline

🎭 Stage 1 — Generate Masks

The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.

🖱️ Step 0 — Select points (GUI)

python VLM-MASK-REASONER/point_selector_gui.py

Load a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.

Config format:

{
  "videos": [
    {
      "video_path": "path/to/video.mp4",
      "output_dir": "path/to/output/folder",
      "instruction": "remove the person"
    }
  ]
}

⚡ Steps 1–4 — Run the full pipeline

After saving the points config, run all remaining stages automatically:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json

Optional flags:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
    --sam2-checkpoint path/to/sam2_hiera_large.pt \
    --device cuda

This runs the following stages in order:

Stage	Script	Output
1 — SAM2 segmentation	`stage1_sam2_segmentation.py`	`black_mask.mp4`
2 — VLM analysis	`stage2_vlm_analysis.py`	`vlm_analysis.json`
3 — Grey mask generation	`stage3a_generate_grey_masks_v2.py`	`grey_mask.mp4`
4 — Combine into quadmask	`stage4_combine_masks.py`	`quadmask_0.mp4`

The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.

🎬 Stage 2 — Inference

VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.

✨ Pass 1 — Base inference

python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="path/to/data_rootdir" \
    --config.experiment.run_seqs="my-video" \
    --config.experiment.save_path="path/to/output" \
    --config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
    --config.video_model.transformer_path="path/to/void_pass1.safetensors"

To run multiple sequences at once, pass a comma-separated list:

--config.experiment.run_seqs="video1,video2,video3"

Key config options:

Flag	Default	Description
`--config.data.sample_size`	`384x672`	Output resolution (HxW)
`--config.data.max_video_length`	`197`	Max frames to process
`--config.video_model.temporal_window_size`	`85`	Temporal window for multidiffusion
`--config.video_model.num_inference_steps`	`50`	Denoising steps
`--config.video_model.guidance_scale`	`1.0`	Classifier-free guidance scale
`--config.system.gpu_memory_mode`	`model_cpu_offload_and_qfloat8`	Memory mode (`model_full_load`, `model_cpu_offload`, `sequential_cpu_offload`)

The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.

🔁 Pass 2 — Warped noise refinement

Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.

Single video:

python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
    --video_name my-video \
    --data_rootdir path/to/data_rootdir \
    --pass1_dir path/to/pass1_outputs \
    --output_dir path/to/pass2_outputs \
    --model_checkpoint path/to/void_pass2.safetensors \
    --model_name path/to/CogVideoX-Fun-V1.5-5b-InP

Batch: Edit the video list and paths in inference/pass_2_refine.sh, then run:

bash inference/pass_2_refine.sh

Key arguments:

Argument	Default	Description
`--pass1_dir`	—	Directory containing Pass 1 output videos
`--output_dir`	`./inference_with_warped_noise`	Where to save Pass 2 results
`--warped_noise_cache_dir`	`./pass1_warped_noise_cache`	Cache for precomputed warped latents
`--temporal_window_size`	`85`	Temporal window size
`--height` / `--width`	`384` / `672`	Output resolution
`--guidance_scale`	`6.0`	CFG scale
`--num_inference_steps`	`50`	Denoising steps
`--use_quadmask`	`True`	Use quadmask conditioning

✏️ Stage 3 — Manual Mask Refinement (Optional)

If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.

python VLM-MASK-REASONER/edit_quadmask.py

Open a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.

Tools: - Grid Toggle — click a grid cell to toggle the interaction region (127 ↔ 255) - Grid Black Toggle — click a grid cell to toggle the primary object region (0 ↔ 255) - Brush (Add / Erase) — freehand paint or erase mask regions at pixel level - Copy from Previous Frame — propagate the black or grey mask from the previous frame

Keyboard shortcuts: ← / → navigate frames, Ctrl+Z / Ctrl+Y undo/redo.

Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.

🏋️ Training

Training Data Generation

Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:

Source 1: HUMOTO (Human-Object Interaction)

Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.

Prerequisites: 1. HUMOTO dataset — Request access from the authors at adobe-research/humoto. Once approved, download and place under data_generation/humoto_release/ 2. Blender — Install Blender (tested with 3.x and 4.x). Also install opencv-python-headless in Blender's Python (see data_generation/README.md) 3. Remy & Sophie characters — Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", downl

Core symbols most depended-on inside this repo

from_pretrained

called by 46

videox_fun/models/cogvideox_vae.py

update_display

called by 16

VLM-MASK-REASONER/edit_quadmask.py

save_videos_grid

called by 11

videox_fun/utils/utils.py

unwrap_model

called by 9

scripts/cogvideox_fun/train.py

unwrap_model

called by 9

scripts/cogvideox_fun/train_warped_noise.py

save_state

called by 9

VLM-MASK-REASONER/edit_quadmask.py

resize_frame

called by 9

videox_fun/data/dataset_image_video.py

resize_frame

called by 9

videox_fun/data/dataset_image_video_warped.py

Shape

Function 345

Method 328

Class 74

Route 4

Languages

Python100%

Modules by API surface

videox_fun/models/cogvideox_vae.py61 symbols

data_generation/render_paired_videos_blender_quadmask.py33 symbols

VLM-MASK-REASONER/edit_quadmask.py32 symbols

videox_fun/data/dataset_image_video_warped.py30 symbols

videox_fun/data/dataset_image_video.py30 symbols

VLM-MASK-REASONER/point_selector_gui.py30 symbols

videox_fun/pipeline/pipeline_cogvideox_fun_inpaint.py26 symbols

videox_fun/reward/MPS/trainer/models/cross_modeling.py25 symbols

videox_fun/utils/lora_utils.py21 symbols

videox_fun/pipeline/pipeline_cogvideox_fun.py21 symbols

videox_fun/models/cogvideox_transformer3d.py21 symbols

videox_fun/utils/utils.py18 symbols

Dependencies from manifests, versioned

Pillow11.3.0 · 1×

absl-py2.3.1 · 1×

accelerate1.12.0 · 1×

albumentations2.0.8 · 1×

beautifulsoup44.13.5 · 1×

came-pytorch0.1.3 · 1×

datasets4.0.0 · 1×

decord0.6.0 · 1×

deepspeed0.17.6 · 1×

diffusers0.33.1 · 1×

einops0.8.0 · 1×

ftfy6.1.1 · 1×

For agents

$ claude mcp add void-model \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact