Rigorously evaluating the capabilities of multimodal models across 8 key skills with over 1,000 hours of video content
InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 53 minutes. (2) The largest set of question-answer pairs for long video comprehension, totaling around \totalSampleNumber. (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking). (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models such as (Qwen2.5-VL, InternVL3.0). Results reveal that: (1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1 % on grounding-based skills, with most models performing near or just above random chance. (2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding. (3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding. Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.
Explore comprehensive evaluation results across all 8 skills with detailed performance metrics, interactive comparisons, and downloadable datasets.
Side-by-side model performance analysis with dynamic filtering
Comprehensive comparison of video question-answering datasets and benchmarks
Category | Benchmark | Questions | Videos | Avg Duration (minutes) |
Total Duration (hours) |
Question Type | QA Source | Annotations | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MCQ | Open | Video | Transcript | Summary | Auto | Human | ||||||
Short | TGIF-QA | 8.5K | 9,575 | 0.05 | 7.98 | |||||||
MSRVTT-QA | 72.8K | 2,990 | 0.25 | 12.45 | ||||||||
MV-Bench | 4.0K | 3,641 | 0.27 | 16.38 | ✗ | ✗ | ||||||
Long | Activity-QA | 8.0K | 800 | 1.85 | 24.67 | |||||||
TVQA | 15.2K | 2,179 | 1.86 | 67.55 | ||||||||
Egoschema | 5.0K | 5,063 | 3.00 | 253.15 | ✓ | ✗ | ✗ | |||||
LongVideoBench | 6.7K | 3,763 | 7.88 | 494.21 | ✗ | ✗ | ||||||
Moviechat | 13.0K | 1,000 | 9.40 | 156.67 | ||||||||
MLVU | 3.1K | 1,730 | 15.50 | 446.92 | ||||||||
MoVQA | 21.9K | 100 | 16.53 | 27.55 | ✓ | ✗ | ✗ | ✗ | ||||
Video-MME | 2.7K | 900 | 16.97 | 254.55 | ✓ | ✗ | ||||||
Very Long | LVBench | 1.6K | 103 | 68.35 | 117.33 | ✗ | ✗ | |||||
★ InfiniBench (Ours) | 91K | 1,219 | 52.59 | 1,068.45 |
Comprehensive analysis across skills and video types, showcasing the breadth and depth of InfiniBench
Number of questions per skill
Number of videos per skill
Average video duration per skill
Episodic content with recurring characters and settings
Self-contained narratives with complete story arcs
Diverse sources of movies and TV shows carefully selected for quality and representation
Rigorous process to create challenging questions that test all 8 core skills
Multi-stage quality control to ensure accuracy and relevance of all benchmark questions
Explore the 8 key skills evaluated in InfiniBench through interactive examples
Ability to identify and track visual elements across the entire video duration
Understanding how scenes change and transition throughout the video narrative
Recognition and interpretation of character behaviors and actions over time
Comprehension of temporal sequences and time-based relationships in narratives
Ability to create concise and accurate summaries of complex video content
Advanced comprehension of implicit meanings and contextual relationships
Recognition of plot-revealing information and story elements that affect narrative
Ability to connect and relate different events within the video timeline