Web Analytics

FlashVID

⭐ 109 stars English by Fanziyang-v

FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Ziyang Fan1Keyu Chen1Ruilong Xing1Yulin Li1Li Jiang2,3Zhuotao Tian1,3* 
1 Harbin Institute of Technology (Shenzhen)     2 The Chinese University of Hong Kong (Shenzhen)
3 Shenzhen Loop Area Institute
*Corresponding Author
    License       transformers  

🔖Table of Contents

🔥News

📋Todo List

✨Highlights

FlashVID Teaser

💡Motivation

Motivation

In this work, we identify two key observations about spatiotemporal redundancy in videos:

To achieve better spatiotemporal redundancy compression, we present a simple yet effective solution: Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy compression, complemented by the Attention and Diversity-based Token Selection (ADTS) module for informative token selection.

🌈Method

Method

Illustration of FlashVID. We compress visual tokens by two synergistic modules.

📦Installation

In this project, we use uv for package management.

git clone https://github.com/Fanziyang-v/FlashVID.git
cd FlashVID

uv sync

🚀Quickstart

FlashVID's code is easy to use and works out of the box. Just wrap the model with the flashvid() function. Currently, FlashVID supports LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL, and Qwen3-VL.

from flashvid import flashvid

model = flashvid( model, retention_ratio=0.1, alpha=0.7, temporal_threshold=0.8, )

📝Note: You can override the default parameters (e.g., retention ratio) in the flashvid() wrapper function.

Inference demos are provided in playground/. Here is a running example:

python playground/llava_ov_infer.py \
    --video-path assets/Qgr4dcsY-60.mp4 \
    --question "Describe the video in detail." \
    --num-frames 32 \
    --enable-flashvid

📊Evaluation

In this project, all the experiments are conducted using LMMs-Eval. We provide FlashVID evaluation scripts in scripts/, including LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL, and Qwen3-VL. You can run the scripts to reproduce our experimental results:

bash scripts/llava_ov.sh

📝Note: It is extremely easy to integrate FlashVID into LMMs-Eval by adding specific parameters in __init__() and wrapping the loaded model with the flashvid() function. (See lmms_eval/models/simple/llava_onevision.py)

👏Acknowledgement

This project is built upon recent open-source works: FastV, VisionZip, PruneVID, FastVID, LLaVA-NeXT, Qwen2.5-VL/Qwen3-VL, LMMs-Eval. Thanks for their excellent work!

📜Citation

If you find this project useful in your research, please consider citing:

@inproceedings{
    fan2026flashvid,
    title={Flash{VID}: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging},
    author={Ziyang Fan and Keyu Chen and Ruilong Xing and Yulin Li and Li Jiang and Zhuotao Tian},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=H6rDX4w6Al}
}

⭐️Star History

Star History Chart

--- Tranlated By Open Ai Tx | Last indexed: 2026-07-03 ---