LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen^1,^2,^*, Yunyang Xiong^1,^☨, Changsheng Zhao¹, Lemeng Wu¹, Jun Chen², Chenchen Zhu¹, Zechun Liu¹, Fanyi Xiao¹, Balakrishnan Varadarajan¹, Florian Bordes¹, Zhuang Liu¹, Hu Xu¹, Hyunwoo J. Kim³, Bilge Soran¹, Raghuraman Krishnamoorthi¹, Mohamed Elhoseiny^2,^☨, Vikas Chandra^1,^☨

¹Meta AI ²King Abdullah University of Science and Technology ³Korea University

^*Work done at Meta ^☨Project lead

Paper Code 🤗 Models 🎨 Demo

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by the given context length. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism to reduce the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within limited context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Examples

User: Please provide a detailed description of the video.

LongVU: The video begins with a scene set in a futuristic cityscape at night, where a humanoid robot stands in the foreground, its back to the camera, against a backdrop of illuminated buildings and a helicopter in flight. The setting is characterized by a blend of modern and cyberpunk elements, with neon signs and a wet street reflecting the city lights. As the video progresses, the robot remains stationary, with the helicopter moving closer, eventually hovering above the street, indicating a possible interaction or event unfolding between the robot and the helicopter.

The scene then shifts to an interior setting, likely a workshop or garage, filled with various electronic equipment and machinery. The robot is now in the foreground, facing the camera, with a person in the background, suggesting a narrative progression. The environment is cluttered, with the lighting casting a blue hue over the scene, emphasizing the technological theme.

As the video continues, the camera angle changes to show the robot from behind, with the person in the background now more visible, indicating movement. The workshop setting remains consistent, with the clutter and electronic equipment unchanged. The lighting and color scheme continue to highlight the technological ambiance.

The perspective then shifts to a side view of the robot, revealing more of its structure and the cluttered workshop environment. The person in the background is now closer to the robot, suggesting interaction or movement towards it. The lighting and color scheme remain consistent, maintaining the technological atmosphere.

Finally, the camera zooms in on the robot, focusing on its upper body and the intricate details of its internal components. The cluttered workshop environment becomes less visible, with the lighting casting a blue hue over the scene, emphasizing the technological theme. The camera's close-up shot highlights the robot's design and the complexity of its internal mechanisms, suggesting a narrative focus on the robot's internal workings or its role within the technological setting.

User: How many toys did the dog play with?

LongVU: The dog played with a total of two toys in the video.

User: Can you identify the direction of the yellow sphere's movement in the video?

LongVU: The yellow sphere is moving away from the red cube and towards the right side of the frame.

User: What is the name of the store?

LongVU: The name of the store is "Animate".

User: Describe this video in detail.

LongVU: The video begins with a scene featuring two animated characters in a fantastical setting, suggesting a narrative of adventure or conflict. The first character, dressed in a yellow and red martial arts uniform with a mask, is seen in a defensive or preparatory stance, while the second character, an older man with a white beard and a blue robe, appears surprised or concerned. The background is filled with green, leaf-like structures and a mountainous landscape, indicating a natural, possibly magical environment.

User: What happened after the person held the dish?

LongVU: After holding the dish, the person placed it in the cabinet.

Long Video Examples

User: In this video, how many times does the scene of the 'making jewelry' action appear in total?

LongVU: The scene of the 'making jewelry' action appears twice in the video.

(For your reference, the relevant section starts around 00:50 and 03:15)

User: Who did I interact with when I skated?

LongVU: During your skating, you interacted with a lady wearing a black dress.

(For your reference, the relevant section starts around 05:00)

User: What is the condition of the highway where the SUV is parked?

LongVU: The highway where the SUV is parked is empty and surrounded by trees.

(For your reference, the relevant section starts around 17:53)

User: What is the chef doing with the lobster in the dinner preparation?

LongVU: The chef is cutting the lobster in half.

(For your reference, the relevant section starts around 05:37)

LongVU Architecture

Architecture of LongVU. Given a densely sampled video frames, we first utilize DINOv2 prior to remove redundant frames, and fuse the remaining frame features from both SigLIP and DINOv2. Then we selectively reduce visual tokens via cross-modal query. Finally, we conduct spatial token compression based on temporal dependencies to further meet the limited context length of LLMs.

Video Understanding Results

Edge Model Results

Citation

@article{shen2024longvu,
    title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
    author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
    journal={arXiv:2410.17434},
    year={2024}
  }