Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding.
The set of skills introduced by InfiniBench includes a total of 9 skills. The figure includes two question examples for two distinct skills: the left example illustrates the Global Appearance skill, and the right example illustrates the Scene Transition skill.
InfiniBench has the largest QA pairs, the most videos, and the longest average duration. (Note: Global Q stands for whether any challenging questions are designed to explain the whole video. VS is the video’s script, and VSum is the summary of the video.)
Left) Number of questions distribution for each skill set. Right) Number of videos for each skill.)
Data statistics. On the left, we report the number of videos and their length in hours from each data source: TVQA and MovieNet datasets. In the middle, we demonstrate the number of questions. On the right, we show the histogram of the lengths of the questions and answers.
Full annotation pipeline for InfiniBench skill set. The upper section depicts the global appearance pipeline, while the lower section illustrates the question generation using GPT-4. The gates for video summary and video transcript indicate that some skills utilize only the summary, others use only the transcript, and some use both.)
Overall performance. The overall performance
of different models on the InfiniBench is shown in
Table 2 (j). Three findings can be observed: (1) All
models’ performance is relatively lower compared
to other benchmarks (e.g., Movie-chat benchmark),
highlighting the unique challenges of our benchmark, such as longer duration. (2) Gemini-Flash
1.5 achieves the best performance on both multiplechoice and open-ended questions, with 47.72 accuracy (0-100) and 2.70 GPT4-score (0-5). There
is also a large performance gap between Gemini
and other open-source models. (3) For open-source
models, LLama-vid achieves the best result. with
17.15 accuracy and 1.7 GPT4-score. One reason
may be that LLama-vid is pre-trained with longer
duration QA-pairs, which helps handle longer sequences.
Performance on specific skills. Table 2 (a)-(i)
shows the performance of SOTA long video
understanding models on each skill. The performance varies significantly among different skills,
highlighting the unique challenges introduced by
each one. Obeservation of the results: (1) scene
transition is the most difficult MCQ question type,
with Gemini achieving only 29.48% accuracy. The
potential reason for the low performance is that
this question requires global reasoning across the
entire hour-long video instead of one clip. (2) all
models struggle with Movie Spoiler questions
in open-ended questions. The difficulty lies in
the need for deeper understanding and reasoning
to get the correct answer. Since Movie Spoiler
questions are meaningful for human-centric video
understanding, current model capabilities need
improvement. (3) All open-source models’ results
on MCQ are below random choice, except for
the Local visual+context questions. This shows
that the main challenge for existing models is
long-sequence global reasoning.
Performance on Four Types of Questions. As
introduced in Section 3.1 in the main paper, in the InfiniBench,
questions for each skill can be identified as one of
four high-level types: Global visual, Global contextual, Global vision + text, and Local vision +
context. The results for each type of question are
provided in Table 3. Only two models, Gemini
Flash 1.5 and LLama-VID accept both video and
video subtitles among these SOTA models. The table clearly shows that LLama-VID outperforms the
other two open-source models for questions requiring context understanding. The main reason for the
poor performance of LWM and MovieChat is that
these two models make predictions from video only,
missing important text information. This highlights
the importance of long video understanding models handling both modalities. Additionally, global
contextual questions are challenging for all models,
requiring complex reasoning.:
@misc{ataallah2024infinibenchcomprehensivebenchmarklarge,
title={InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding},
author={Kirolos Ataallah and Chenhui Gou and Eslam Abdelrahman and Khushbu Pahwa and Jian Ding and Mohamed Elhoseiny},
year={2024},
eprint={2406.19875},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.19875},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.