Rigorously evaluating the capabilities of multimodal models across 8 key skills with over 1,000 hours of video content
InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a major challenge for multi-modal models. Existing benchmarks often fall short in testing the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. We introduce {\papernameAbbrev}, a comprehensive benchmark designed to rigorously evaluate the capabilities of models in long video understanding. InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 52.59 minutes, (2) The largest set of question-answer pairs for long video comprehension, totaling around \totalSampleNumber, (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context, multi-event linking) understanding, and (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conduct an in-depth evaluation across both commercial (GPT-4o, Gemini 1.5 Flash) and open-source (Qwen2.5-VL, InternVL2.5) vision-language models. Results reveal that current models remain far from solving long video understanding: on grounding-based skills, the top open-source model (Qwen2.5-VL) and GPT-4o achieve only 39.4\% and 48.1\% accuracy, respectively. Interestingly, several models achieve non-trivial performance using only the movie or episode title, without watching the video, revealing a reliance on pre-trained world knowledge that partially compensates for the absence of visual or temporal understanding. These findings highlight critical gaps in current approaches and underscore the need for models that truly engage with long visual narratives.
Explore comprehensive evaluation results across all 8 skills with detailed performance metrics, interactive comparisons, and downloadable datasets.
Side-by-side model performance analysis with dynamic filtering
Comprehensive comparison of video question-answering datasets and benchmarks
Category | Benchmark | Questions | Videos | Avg Duration (minutes) |
Total Duration (hours) |
Question Type | QA Source | Annotations | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MCQ | Open | Video | Transcript | Summary | Auto | Human | ||||||
Short | TGIF-QA | 8.5K | 9,575 | 0.05 | 7.98 | |||||||
MSRVTT-QA | 72.8K | 2,990 | 0.25 | 12.45 | ||||||||
MV-Bench | 4.0K | 3,641 | 0.27 | 16.38 | ✗ | ✗ | ||||||
Long | Activity-QA | 8.0K | 800 | 1.85 | 24.67 | |||||||
TVQA | 15.2K | 2,179 | 1.86 | 67.55 | ||||||||
Egoschema | 5.0K | 5,063 | 3.00 | 253.15 | ✓ | ✗ | ✗ | |||||
LongVideoBench | 6.7K | 3,763 | 7.88 | 494.21 | ✗ | ✗ | ||||||
Moviechat | 13.0K | 1,000 | 9.40 | 156.67 | ||||||||
MLVU | 3.1K | 1,730 | 15.50 | 446.92 | ||||||||
MoVQA | 21.9K | 100 | 16.53 | 27.55 | ✓ | ✗ | ✗ | ✗ | ||||
Video-MME | 2.7K | 900 | 16.97 | 254.55 | ✓ | ✗ | ||||||
Very Long | LVBench | 1.6K | 103 | 68.35 | 117.33 | ✗ | ✗ | |||||
★ InfiniBench (Ours) | 91K | 1,219 | 52.59 | 1,068.45 |
Comprehensive analysis across skills and video types, showcasing the breadth and depth of InfiniBench
Number of questions per skill
Number of videos per skill
Average video duration per skill
Episodic content with recurring characters and settings
Self-contained narratives with complete story arcs
Diverse sources of movies and TV shows carefully selected for quality and representation
Rigorous process to create challenging questions that test all 8 core skills
Multi-stage quality control to ensure accuracy and relevance of all benchmark questions
Explore the 8 key skills evaluated in InfiniBench through interactive examples
Ability to identify and track visual elements across the entire video duration
Understanding how scenes change and transition throughout the video narrative
Recognition and interpretation of character behaviors and actions over time
Comprehension of temporal sequences and time-based relationships in narratives
Ability to create concise and accurate summaries of complex video content
Advanced comprehension of implicit meanings and contextual relationships
Recognition of plot-revealing information and story elements that affect narrative
Ability to connect and relate different events within the video timeline