InfiniBench

A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding


King Abdullah University of Science and Technology
Monash University RICE University

Abstract

Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding.


Benchmark skills

The set of skills introduced by InfiniBench includes a total of 9 skills. The figure includes two question examples for two distinct skills: the left example illustrates the Global Appearance skill, and the right example illustrates the Scene Transition skill.



Comparison between InfiniBench and existing video understanding benchmarks.

InfiniBench has the largest QA pairs, the most videos, and the longest average duration. (Note: Global Q stands for whether any challenging questions are designed to explain the whole video. VS is the video’s script, and VSum is the summary of the video.)



Benchmark statistics.

Left) Number of questions distribution for each skill set. Right) Number of videos for each skill.)

Data statistics. On the left, we report the number of videos and their length in hours from each data source: TVQA and MovieNet datasets. In the middle, we demonstrate the number of questions. On the right, we show the histogram of the lengths of the questions and answers.



Full annotation pipeline.

Full annotation pipeline for InfiniBench skill set. The upper section depicts the global appearance pipeline, while the lower section illustrates the question generation using GPT-4. The gates for video summary and video transcript indicate that some skills utilize only the summary, others use only the transcript, and some use both.)



Results

Overall performance. The overall performance of different models on the InfiniBench is shown in Table 2 (j). Three findings can be observed: (1) All models’ performance is relatively lower compared to other benchmarks (e.g., Movie-chat benchmark), highlighting the unique challenges of our benchmark, such as longer duration. (2) Gemini-Flash 1.5 achieves the best performance on both multiplechoice and open-ended questions, with 47.72 accuracy (0-100) and 2.70 GPT4-score (0-5). There is also a large performance gap between Gemini and other open-source models. (3) For open-source models, LLama-vid achieves the best result. with 17.15 accuracy and 1.7 GPT4-score. One reason may be that LLama-vid is pre-trained with longer duration QA-pairs, which helps handle longer sequences.
Performance on specific skills. Table 2 (a)-(i) shows the performance of SOTA long video understanding models on each skill. The performance varies significantly among different skills, highlighting the unique challenges introduced by each one. Obeservation of the results: (1) scene transition is the most difficult MCQ question type, with Gemini achieving only 29.48% accuracy. The potential reason for the low performance is that this question requires global reasoning across the entire hour-long video instead of one clip. (2) all models struggle with Movie Spoiler questions in open-ended questions. The difficulty lies in the need for deeper understanding and reasoning to get the correct answer. Since Movie Spoiler questions are meaningful for human-centric video understanding, current model capabilities need improvement. (3) All open-source models’ results on MCQ are below random choice, except for the Local visual+context questions. This shows that the main challenge for existing models is long-sequence global reasoning.
Performance on Four Types of Questions. As introduced in Section 3.1 in the main paper, in the InfiniBench, questions for each skill can be identified as one of four high-level types: Global visual, Global contextual, Global vision + text, and Local vision + context. The results for each type of question are provided in Table 3. Only two models, Gemini Flash 1.5 and LLama-VID accept both video and video subtitles among these SOTA models. The table clearly shows that LLama-VID outperforms the other two open-source models for questions requiring context understanding. The main reason for the poor performance of LWM and MovieChat is that these two models make predictions from video only, missing important text information. This highlights the importance of long video understanding models handling both modalities. Additionally, global contextual questions are challenging for all models, requiring complex reasoning.
:


High level aggregated skills.

Results for the high level aggregated skills.


Examples

Linking multiple events questions example Temporal order of events questions example Local questions example Deep context understanding questions skill Summarization questions example

BibTeX


      @misc{ataallah2024infinibenchcomprehensivebenchmarklarge,
        title={InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding}, 
        author={Kirolos Ataallah and Chenhui Gou and Eslam Abdelrahman and Khushbu Pahwa and Jian Ding and Mohamed Elhoseiny},
        year={2024},
        eprint={2406.19875},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.19875}, 
  }
    

Acknowledgement

Video-ChatGPT

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.