MCPcopy Index your code
hub / github.com/Netflix/void-model

github.com/Netflix/void-model @main sqlite

repository ↗ · DeepWiki ↗
751 symbols 2,538 edges 69 files 322 documented · 43% 19 cross-repo links
README

VOID: Video Object and Interaction Deletion

🎉 Accepted at ECCV 2026

Website arXiv Data Models Models Open in Colab

[Saman Motamed](https://sam-motamed.github.io/)1,2, [William Harvey](https://scholar.google.com/citations?user=kDd7nBkAAAAJ&hl=en)1, [Benjamin Klein](https://scholar.google.com/citations?user=xkX9W9QAAAAJ&hl=en)1, [Luc Van Gool](https://scholar.google.com/citations?user=TwMib_QAAAAJ&hl=en)2, [Zhuoning Yuan](https://zhuoning.cc/)1, [Ta-Ying Cheng](https://ttchengab.github.io/)1 1Netflix    2INSAIT, Sofia University "St. Kliment Ohridski"


VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.

Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.


TODO 📋

  • [ ] 🤗 Diffusers pipeline support

🤖 Models

VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.

Model Description HuggingFace
VOID Pass 1 Base inpainting model Download
VOID Pass 2 Warped-noise refinement model Download

Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).


▶️ Quick Start

The fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:

Open in Colab

Note: Requires a GPU with 40GB+ VRAM (e.g., A100).

For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.


⚙️ Setup

pip install -r requirements.txt

Stage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:

export GEMINI_API_KEY=your_key_here

Also install SAM2+3 separately (required for mask generation):

git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e .

git clone https://github.com/facebookresearch/sam3.git
cd sam3 && pip install -e .

Download the pretrained base inpainting model from HuggingFace:

hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

The inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.

If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:

ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg

📁 Expected directory structure

After cloning the repo and downloading all assets, your directory should look like this:

VOID/
├── config/
├── datasets/
│   └── void_train_data.json
├── inference/
├── sample/                         # included sample sequences for inference
├── scripts/
├── videox_fun/
├── VLM-MASK-REASONER/
├── README.md
├── requirements.txt
│
├── CogVideoX-Fun-V1.5-5b-InP/     # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
├── void_pass1.safetensors          # download from huggingface.co/void-model (see Models above)
├── void_pass2.safetensors          # download from huggingface.co/void-model (see Models above)
├── training_data/                  # generated via data_generation/ pipeline (see Training section)
└── data_generation/                # data generation code (HUMOTO + Kubric pipelines)

📂 Input Format

Each video sequence lives in its own folder under a root data directory:

data_rootdir/
└── my-video/
    ├── input_video.mp4      # source video
    ├── quadmask_0.mp4       # quadmask (4-value mask video, see below)
    └── prompt.json          # {"bg": "background description"}

The prompt.json contains a single "bg" key describing the scene after the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.

{ "bg": "A table with a cup on it." }         // ✅ describes the clean background
{ "bg": "A person being removed from scene." } // ❌ don't describe the removal

A few examples from the included samples:

Sequence Removed object bg prompt
lime the glass "A lime falls on the table."
moving_ball the rubber duckie "A ball rolls off the table."
pillow the kettlebell being placed on the pillow "Two pillows are on the table."

The quadmask encodes four semantic regions per pixel:

Value Meaning
0 Primary object to remove
63 Overlap of primary + affected regions
127 Affected region (interactions: falling objects, displaced items, etc.)
255 Background (keep)

🚀 Pipeline

🎭 Stage 1 — Generate Masks

The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.

🖱️ Step 0 — Select points (GUI)

python VLM-MASK-REASONER/point_selector_gui.py

Load a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.

Config format:

{
  "videos": [
    {
      "video_path": "path/to/video.mp4",
      "output_dir": "path/to/output/folder",
      "instruction": "remove the person"
    }
  ]
}

⚡ Steps 1–4 — Run the full pipeline

After saving the points config, run all remaining stages automatically:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json

Optional flags:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
    --sam2-checkpoint path/to/sam2_hiera_large.pt \
    --device cuda

This runs the following stages in order:

Stage Script Output
1 — SAM2 segmentation stage1_sam2_segmentation.py black_mask.mp4
2 — VLM analysis stage2_vlm_analysis.py vlm_analysis.json
3 — Grey mask generation stage3a_generate_grey_masks_v2.py grey_mask.mp4
4 — Combine into quadmask stage4_combine_masks.py quadmask_0.mp4

The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.


🎬 Stage 2 — Inference

VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.

✨ Pass 1 — Base inference

python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="path/to/data_rootdir" \
    --config.experiment.run_seqs="my-video" \
    --config.experiment.save_path="path/to/output" \
    --config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
    --config.video_model.transformer_path="path/to/void_pass1.safetensors"

To run multiple sequences at once, pass a comma-separated list:

--config.experiment.run_seqs="video1,video2,video3"

Key config options:

Flag Default Description
--config.data.sample_size 384x672 Output resolution (HxW)
--config.data.max_video_length 197 Max frames to process
--config.video_model.temporal_window_size 85 Temporal window for multidiffusion
--config.video_model.num_inference_steps 50 Denoising steps
--config.video_model.guidance_scale 1.0 Classifier-free guidance scale
--config.system.gpu_memory_mode model_cpu_offload_and_qfloat8 Memory mode (model_full_load, model_cpu_offload, sequential_cpu_offload)

The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.

🔁 Pass 2 — Warped noise refinement

Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.

Single video:

python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
    --video_name my-video \
    --data_rootdir path/to/data_rootdir \
    --pass1_dir path/to/pass1_outputs \
    --output_dir path/to/pass2_outputs \
    --model_checkpoint path/to/void_pass2.safetensors \
    --model_name path/to/CogVideoX-Fun-V1.5-5b-InP

Batch: Edit the video list and paths in inference/pass_2_refine.sh, then run:

bash inference/pass_2_refine.sh

Key arguments:

Argument Default Description
--pass1_dir Directory containing Pass 1 output videos
--output_dir ./inference_with_warped_noise Where to save Pass 2 results
--warped_noise_cache_dir ./pass1_warped_noise_cache Cache for precomputed warped latents
--temporal_window_size 85 Temporal window size
--height / --width 384 / 672 Output resolution
--guidance_scale 6.0 CFG scale
--num_inference_steps 50 Denoising steps
--use_quadmask True Use quadmask conditioning

✏️ Stage 3 — Manual Mask Refinement (Optional)

If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.

python VLM-MASK-REASONER/edit_quadmask.py

Open a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.

Tools: - Grid Toggle — click a grid cell to toggle the interaction region (127255) - Grid Black Toggle — click a grid cell to toggle the primary object region (0255) - Brush (Add / Erase) — freehand paint or erase mask regions at pixel level - Copy from Previous Frame — propagate the black or grey mask from the previous frame

Keyboard shortcuts: / navigate frames, Ctrl+Z / Ctrl+Y undo/redo.

Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.


🏋️ Training

Training Data Generation

Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:

Source 1: HUMOTO (Human-Object Interaction)

Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.

Prerequisites: 1. HUMOTO dataset — Request access from the authors at adobe-research/humoto. Once approved, download and place under data_generation/humoto_release/ 2. Blender — Install Blender (tested with 3.x and 4.x). Also install opencv-python-headless in Blender's Python (see data_generation/README.md) 3. Remy & Sophie characters — Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", downl

Core symbols most depended-on inside this repo

from_pretrained
called by 46
videox_fun/models/cogvideox_vae.py
update_display
called by 16
VLM-MASK-REASONER/edit_quadmask.py
save_videos_grid
called by 11
videox_fun/utils/utils.py
unwrap_model
called by 9
scripts/cogvideox_fun/train.py
unwrap_model
called by 9
scripts/cogvideox_fun/train_warped_noise.py
save_state
called by 9
VLM-MASK-REASONER/edit_quadmask.py
resize_frame
called by 9
videox_fun/data/dataset_image_video.py
resize_frame
called by 9
videox_fun/data/dataset_image_video_warped.py

Shape

Function 345
Method 328
Class 74
Route 4

Languages

Python100%

Modules by API surface

videox_fun/models/cogvideox_vae.py61 symbols
data_generation/render_paired_videos_blender_quadmask.py33 symbols
VLM-MASK-REASONER/edit_quadmask.py32 symbols
videox_fun/data/dataset_image_video_warped.py30 symbols
videox_fun/data/dataset_image_video.py30 symbols
VLM-MASK-REASONER/point_selector_gui.py30 symbols
videox_fun/pipeline/pipeline_cogvideox_fun_inpaint.py26 symbols
videox_fun/reward/MPS/trainer/models/cross_modeling.py25 symbols
videox_fun/utils/lora_utils.py21 symbols
videox_fun/pipeline/pipeline_cogvideox_fun.py21 symbols
videox_fun/models/cogvideox_transformer3d.py21 symbols
videox_fun/utils/utils.py18 symbols

Dependencies from manifests, versioned

Pillow11.3.0 · 1×
absl-py2.3.1 · 1×
accelerate1.12.0 · 1×
albumentations2.0.8 · 1×
beautifulsoup44.13.5 · 1×
came-pytorch0.1.3 · 1×
datasets4.0.0 · 1×
decord0.6.0 · 1×
deepspeed0.17.6 · 1×
diffusers0.33.1 · 1×
einops0.8.0 · 1×
ftfy6.1.1 · 1×

For agents

$ claude mcp add void-model \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact