InfiniBench

A Benchmark for Large Multi-Modal Models
in Long-Form Movies & TV Shows

Kirolos Ataallah *, Eslam Abdelrahman *, Mahmoud Ahmed, Chenhui Gou,
Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

🎉 Accepted at EMNLP 2025 🎉

Rigorously evaluating the capabilities of multimodal models across 8 key skills with over 1,000 hours of video content

Paper GitHub 🤗 Download Dataset 🏆 Interactive Leaderboard

◆ King Abdullah University of Science and Technology (KAUST) ◆ Monash University ◆ RICE University

* Equal contribution

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

🏆

Live Challenge Running!

🚀

Join the InfiniBench Challenge and test your models against our comprehensive benchmark!

📅 Duration

August 12, 2025 to October 10, 2025

🎯 Platform

Hosted on CodaBench

🚀 Participate Now →

⏰ Limited time opportunity to showcase your model's capabilities!

Abstract

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around \totalSampleNumber. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models such as (Qwen2.5-VL, InternVL3.0). Results reveal that: (1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1 % on grounding-based skills, with most models performing near or just above random chance. (2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding. (3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding. Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.

🏆 Model Performance Results

Explore comprehensive evaluation results across all 8 skills with detailed performance metrics, interactive comparisons, and downloadable datasets.

10+

Evaluated Models

Core Skills

🏆 Go to Interactive Leaderboard

Interactive Comparisons

Side-by-side model performance analysis with dynamic filtering

Category	Benchmark	Questions	Videos	Avg Duration (minutes)	Total Duration (hours)	Question Type	QA Source	Annotations
Short	TGIF-QA	8.5K	9,575	0.05	7.98	✗	✓	✓	✗	✗	✓	✓
MSRVTT-QA	72.8K	2,990	0.25	12.45	✗	✓	✓	✗	✗	✓	✗
MV-Bench	4.0K	3,641	0.27	16.38	✓	✗	✓	✗	✗	✓	✗
Long	Activity-QA	8.0K	800	1.85	24.67	✗	✓	✓	✗	✗	✗	✓
TVQA	15.2K	2,179	1.86	67.55	✓	✗	✓	✗	✗	✗	✓
Egoschema	5.0K	5,063	3.00	253.15	✓	✗	✓	✗	✗	✓	✓
LongVideoBench	6.7K	3,763	7.88	494.21	✓	✗	✓	✗	✗	✗	✓
Moviechat	13.0K	1,000	9.40	156.67	✗	✓	✓	✗	✗	✗	✓
MLVU	3.1K	1,730	15.50	446.92	✓	✓	✓	✗	✗	✓	✓
MoVQA	21.9K	100	16.53	27.55	✓	✗	✓	✗	✗	✗	✓
Video-MME	2.7K	900	16.97	254.55	✓	✗	✓	✗	✗	✗	✓
Very Long	LVBench	1.6K	103	68.35	117.33	✓	✗	✓	✗	✗	✗	✓
★ InfiniBench (Ours)	91K	1,219	52.59	1,068.45	✓	✓	✓	✓	✓	✓	✓

Benchmark Statistics

Comprehensive analysis across skills and video types, showcasing the breadth and depth of InfiniBench

InfiniBench Skills Statistics

Hover to analyze details

Number of questions per skill

Number of videos per skill

Average video duration per skill

TV Shows vs. Movies

Hover to analyze details

TV Shows

Episodic content with recurring characters and settings

Movies

Self-contained narratives with complete story arcs

Annotation Pipeline

1 Data Collection

Diverse sources of movies and TV shows carefully selected for quality and representation

2 Question Generation

Rigorous process to create challenging questions that test all 8 core skills

3 Human Verification

Multi-stage quality control to ensure accuracy and relevance of all benchmark questions

Skills Examples

Explore the 8 key skills evaluated in InfiniBench through interactive examples

Global Appearance

Ability to identify and track visual elements across the entire video duration

Scene Transitions

Understanding how scenes change and transition throughout the video narrative

Character Actions

Recognition and interpretation of character behaviors and actions over time

Chronological Understanding

Comprehension of temporal sequences and time-based relationships in narratives

Summarization

Ability to create concise and accurate summaries of complex video content

Deep Context Understanding

Advanced comprehension of implicit meanings and contextual relationships

Spoiler Understanding

Recognition of plot-revealing information and story elements that affect narrative

Linking Events

Ability to connect and relate different events within the video timeline

Grounding vs. Reasoning Skills

Grounding-Based Skills

Global Appearance
Scene Transitions
Character Actions
Chronological Understanding

Reasoning-Based Skills

Summarization
Deep Context Understanding
Spoiler Understanding
Linking Events

InfiniBench

A Benchmark for Large Multi-Modal Models
in Long-Form Movies & TV Shows

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.