Demo visualization showing Spatio-temporal Grounding and Retrieval capabilities.
We introduce Vidi, a family of Large Multimodal Models (LMMs) designed for a wide range of video understanding and editing (VUE) scenarios.
The project has evolved through significant milestones:
2025-11-25
Tech report, GitHub code, and updated Demo are now available.
2025-08-29
Released with a completely new UI design.
2025-06-06
Initial demo release for the 7B model.
2025-04-21
Vidi tech report and VUE-TR evaluation benchmark released.
Input a text query indicating an object. Vidi finds the clips and draws bounding boxes around the object throughout the video duration.
Search within videos using natural language. The model identifies precise time ranges corresponding to your query.
Open-ended question answering. Ask complex questions about the video content and get detailed, context-aware answers.
Automatically output a set of highlight clips with titles, summarizing the most important parts of the video without user queries.
# Clone the repository
git clone https://github.com/bytedance/vidi
cd vidi
# Run installation script
bash install.sh
python3 -u inference.py \
--video-path ./example_video.mp4 \
--query "slicing onion" \
--model-path ./checkpoints/Vidi-7B
We release benchmarks for both Spatio-Temporal Grounding and Temporal Retrieval.