InfiniBench

A Benchmark for Large Multi-Modal Models
in Long-Form Movies & TV Shows

Rigorously evaluating the capabilities of multimodal models across 8 key skills with over 1,000 hours of video content

King Abdullah University of Science and Technology (KAUST) Monash University RICE University

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Abstract

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a major challenge for multi-modal models. Existing benchmarks often fall short in testing the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. We introduce {\papernameAbbrev}, a comprehensive benchmark designed to rigorously evaluate the capabilities of models in long video understanding. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 52.59 minutes, (2) The largest set of question-answer pairs for long video comprehension, totaling around \totalSampleNumber, (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context, multi-event linking) understanding, and (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conduct an in-depth evaluation across both commercial (GPT-4o, Gemini 1.5 Flash) and open-source (Qwen2.5-VL, InternVL2.5) vision-language models. Results reveal that current models remain far from solving long video understanding: on grounding-based skills, the top open-source model (Qwen2.5-VL) and GPT-4o achieve only 39.4\% and 48.1\% accuracy, respectively. Interestingly, several models achieve non-trivial performance using only the movie or episode title, without watching the video, revealing a reliance on pre-trained world knowledge that partially compensates for the absence of visual or temporal understanding. These findings highlight critical gaps in current approaches and underscore the need for models that truly engage with long visual narratives.

🏆 Model Performance Results

Explore comprehensive evaluation results across all 8 skills with detailed performance metrics, interactive comparisons, and downloadable datasets.

10+
Evaluated Models
8
Core Skills

Interactive Comparisons

Side-by-side model performance analysis with dynamic filtering

InfiniBench vs. Existing Video Understanding Benchmarks

Comprehensive comparison of video question-answering datasets and benchmarks

Category Benchmark Questions Videos Avg Duration
(minutes)
Total Duration
(hours)
Question Type QA Source Annotations
MCQ Open Video Transcript Summary Auto Human
Short TGIF-QA 8.5K 9,575 0.05 7.98
MSRVTT-QA 72.8K 2,990 0.25 12.45
MV-Bench 4.0K 3,641 0.27 16.38
Long Activity-QA 8.0K 800 1.85 24.67
TVQA 15.2K 2,179 1.86 67.55
Egoschema 5.0K 5,063 3.00 253.15
LongVideoBench 6.7K 3,763 7.88 494.21
Moviechat 13.0K 1,000 9.40 156.67
MLVU 3.1K 1,730 15.50 446.92
MoVQA 21.9K 100 16.53 27.55
Video-MME 2.7K 900 16.97 254.55
Very Long LVBench 1.6K 103 68.35 117.33
★ InfiniBench (Ours) 91K 1,219 52.59 1,068.45
91K
Total Questions in InfiniBench
1,068
Total Video Hours
52.6
Average Video Duration (min)
100%
Feature Coverage

Legend & Key Features

Feature Available
Feature Not Available
📊 MCQ: Multiple Choice Questions
💬 Open: Open-ended Questions
🎥 Video: Video-based QA Source
📝 Transcript: Transcript-based QA
📋 Summary: Summary-based QA
🤖 Auto: Automated Annotations
👥 Human: Human Annotations

Comparison between InfiniBench and existing video understanding benchmarks. InfiniBench has the largest number of QA pairs and the longest total video duration.

Benchmark Statistics

Comprehensive analysis across skills and video types, showcasing the breadth and depth of InfiniBench

InfiniBench Skills Statistics

Hover to analyze details
Skills Statistics
A

Number of questions per skill

B

Number of videos per skill

C

Average video duration per skill

TV Shows vs. Movies

Hover to analyze details
TV Shows vs Movies

TV Shows

Episodic content with recurring characters and settings

Movies

Self-contained narratives with complete story arcs

Annotation Pipeline

Annotation Pipeline

1 Data Collection

Diverse sources of movies and TV shows carefully selected for quality and representation

2 Question Generation

Rigorous process to create challenging questions that test all 8 core skills

3 Human Verification

Multi-stage quality control to ensure accuracy and relevance of all benchmark questions

Skills Examples

Explore the 8 key skills evaluated in InfiniBench through interactive examples

Global Appearance
Global appearance skill example - showing character tracking across video timeline

Ability to identify and track visual elements across the entire video duration

Scene Transitions
Scene transition skill example

Understanding how scenes change and transition throughout the video narrative

Character Actions
Character actions skill example

Recognition and interpretation of character behaviors and actions over time

Chronological Understanding
Chronological understanding skill example

Comprehension of temporal sequences and time-based relationships in narratives

Summarization
Summarization skill example

Ability to create concise and accurate summaries of complex video content

Deep Context Understanding
Deep context understanding skill example

Advanced comprehension of implicit meanings and contextual relationships

Spoiler Understanding
Spoiler understanding skill example

Recognition of plot-revealing information and story elements that affect narrative

Linking Events
Linking events skill example

Ability to connect and relate different events within the video timeline

Grounding vs. Reasoning Skills

Grounding-Based Skills

  • Global Appearance
  • Scene Transitions
  • Character Actions
  • Chronological Understanding

Reasoning-Based Skills

  • Summarization
  • Deep Context Understanding
  • Spoiler Understanding
  • Linking Events