LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding


1Meta AI  2King Abdullah University of Science and Technology 3Korea University
*Work done at Meta  Project lead

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by the given context length. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism to reduce the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within limited context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Examples


Long Video Examples


LongVU Architecture

Architecture of LongVU. Given a densely sampled video frames, we first utilize DINOv2 prior to remove redundant frames, and fuse the remaining frame features from both SigLIP and DINOv2. Then we selectively reduce visual tokens via cross-modal query. Finally, we conduct spatial token compression based on temporal dependencies to further meet the limited context length of LLMs.

Video Understanding Results

Edge Model Results

Citation

@article{shen2024longvu,
    title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
    author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
    journal={arXiv:2410.17434},
    year={2024}
  }