Vidi2: Large Multimodal Models for
Video Understanding and Creation

Demo visualization showing Spatio-temporal Grounding and Retrieval capabilities.

Abstract

We introduce Vidi, a family of Large Multimodal Models (LMMs) designed for a wide range of video understanding and editing (VUE) scenarios.

The project has evolved through significant milestones:

  • First Release (Vidi 1.0): Focused on temporal retrieval (TR), enabling the identification of specific time ranges in input videos that correspond to a given text query.
  • Second Release (Vidi2): Evolves toward a comprehensive foundation model. It achieves state-of-the-art performance in Spatio-temporal Grounding (STG) and temporal retrieval capability, while maintaining robust open-ended video QA performance.

Latest Updates

2025-11-25

🔥 Vidi2 Released

Tech report, GitHub code, and updated Demo are now available.

2025-08-29

Vidi1.5-9B Demo

Released with a completely new UI design.

2025-06-06

Vidi-7B Demo

Initial demo release for the 7B model.

2025-04-21

First Release

Vidi tech report and VUE-TR evaluation benchmark released.

Core Capabilities

Spatio-Temporal Grounding

Input a text query indicating an object. Vidi finds the clips and draws bounding boxes around the object throughout the video duration.

Temporal Retrieval

Search within videos using natural language. The model identifies precise time ranges corresponding to your query.

Video QA (VQA)

Open-ended question answering. Ask complex questions about the video content and get detailed, context-aware answers.

Highlight Generation

Automatically output a set of highlight clips with titles, summarizing the most important parts of the video without user queries.

Quick Start & Evaluation

Installation

# Clone the repository

git clone https://github.com/bytedance/vidi

cd vidi

# Run installation script

bash install.sh

Inference Example (7B Model)

python3 -u inference.py \

--video-path ./example_video.mp4 \

--query "slicing onion" \

--model-path ./checkpoints/Vidi-7B

Evaluation Data

We release benchmarks for both Spatio-Temporal Grounding and Temporal Retrieval.