InfiniBench

A Benchmark for Large Multi-Modal Models
in Long-Form Movies & TV Shows

Rigorously evaluating the capabilities of multimodal models across 8 key skills with over 1,000 hours of video content

King Abdullah University of Science and Technology (KAUST) Monash University RICE University

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

🏆

Live Challenge Running!

🚀

Join the InfiniBench Challenge and test your models against our comprehensive benchmark!

📅 Duration

August 12, 2025 to October 10, 2025

🎯 Platform

Hosted on CodaBench

🚀 Participate Now

Limited time opportunity to showcase your model's capabilities!

Abstract

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around \totalSampleNumber. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models such as (Qwen2.5-VL, InternVL3.0). Results reveal that: (1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1 % on grounding-based skills, with most models performing near or just above random chance. (2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding. (3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding. Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.

🏆 Model Performance Results

Explore comprehensive evaluation results across all 8 skills with detailed performance metrics, interactive comparisons, and downloadable datasets.

10+
Evaluated Models
8
Core Skills

Interactive Comparisons

Side-by-side model performance analysis with dynamic filtering

InfiniBench vs. Existing Video Understanding Benchmarks

Comprehensive comparison of video question-answering datasets and benchmarks

Category Benchmark Questions Videos Avg Duration
(minutes)
Total Duration
(hours)
Question Type QA Source Annotations
MCQ Open Video Transcript Summary Auto Human
Short TGIF-QA 8.5K 9,575 0.05 7.98
MSRVTT-QA 72.8K 2,990 0.25 12.45
MV-Bench 4.0K 3,641 0.27 16.38
Long Activity-QA 8.0K 800 1.85 24.67
TVQA 15.2K 2,179 1.86 67.55
Egoschema 5.0K 5,063 3.00 253.15
LongVideoBench 6.7K 3,763 7.88 494.21
Moviechat 13.0K 1,000 9.40 156.67
MLVU 3.1K 1,730 15.50 446.92
MoVQA 21.9K 100 16.53 27.55
Video-MME 2.7K 900 16.97 254.55
Very Long LVBench 1.6K 103 68.35 117.33
★ InfiniBench (Ours) 91K 1,219 52.59 1,068.45
91K
Total Questions in InfiniBench
1,068
Total Video Hours
52.6
Average Video Duration (min)
100%
Feature Coverage

Legend & Key Features

Feature Available
Feature Not Available
📊 MCQ: Multiple Choice Questions
💬 Open: Open-ended Questions
🎥 Video: Video-based QA Source
📝 Transcript: Transcript-based QA
📋 Summary: Summary-based QA
🤖 Auto: Automated Annotations
👥 Human: Human Annotations

Comparison between InfiniBench and existing video understanding benchmarks. InfiniBench has the largest number of QA pairs and the longest total video duration.

Benchmark Statistics

Comprehensive analysis across skills and video types, showcasing the breadth and depth of InfiniBench

InfiniBench Skills Statistics

Hover to analyze details
Skills Statistics
A

Number of questions per skill

B

Number of videos per skill

C

Average video duration per skill

TV Shows vs. Movies

Hover to analyze details
TV Shows vs Movies

TV Shows

Episodic content with recurring characters and settings

Movies

Self-contained narratives with complete story arcs

Annotation Pipeline

Annotation Pipeline

1 Data Collection

Diverse sources of movies and TV shows carefully selected for quality and representation

2 Question Generation

Rigorous process to create challenging questions that test all 8 core skills

3 Human Verification

Multi-stage quality control to ensure accuracy and relevance of all benchmark questions

Skills Examples

Explore the 8 key skills evaluated in InfiniBench through interactive examples

Global Appearance
Global appearance skill example - showing character tracking across video timeline

Ability to identify and track visual elements across the entire video duration

Scene Transitions
Scene transition skill example

Understanding how scenes change and transition throughout the video narrative

Character Actions
Character actions skill example

Recognition and interpretation of character behaviors and actions over time

Chronological Understanding
Chronological understanding skill example

Comprehension of temporal sequences and time-based relationships in narratives

Summarization
Summarization skill example

Ability to create concise and accurate summaries of complex video content

Deep Context Understanding
Deep context understanding skill example

Advanced comprehension of implicit meanings and contextual relationships

Spoiler Understanding
Spoiler understanding skill example

Recognition of plot-revealing information and story elements that affect narrative

Linking Events
Linking events skill example

Ability to connect and relate different events within the video timeline

Grounding vs. Reasoning Skills

Grounding-Based Skills

  • Global Appearance
  • Scene Transitions
  • Character Actions
  • Chronological Understanding

Reasoning-Based Skills

  • Summarization
  • Deep Context Understanding
  • Spoiler Understanding
  • Linking Events