image image |
|---|
VANTAGE-BENCH
Video ANalysis Tasks Across Generalized Environments
Dataset Description
VANTAGE-BENCH is the first public benchmark purpose-built for evaluating visual understanding on video captured by fixed infrastructure cameras. It spans three real-world domains — warehouse, smart city / Intelligent Transportation Systems (ITS), and smart spaces — across six spatio-temporal video understanding tasks including video question answering (VQA), temporal grounding, dense video captioning, event verification, spatial grounding, and spatio-temporal tracking.
This dataset is for evaluation purposes only.
Dataset Owner(s)
NVIDIA Corporation
Dataset Creation Date
April 24, 2026
License/Terms of Use
Visit the NVIDIA Legal Release Process for instructions on getting legal support for a license selection:
Dataset Characterization
Data Collection Method
Hybrid: Human, Synthetic, Automated. Video data is sourced from vendor-provided footage (GoPro captures of warehouse and smart space environments), synthetic generation (DriveSim collision and multi-camera scenarios), and publicly scraped sources (Dubuque highway/ITS footage).
Labeling Method
Hybrid: Human, Synthetic, Pseudolabeled. Annotations for VQA, dense video captions, and temporal localization are primarily human-authored. Spatial grounding labels (2D/3D bounding boxes, referring expressions) use a combination of human annotation and pseudolabeling pipelines (detection + SAM for spatial pointing). Event verification labels are human-curated. Annotations are held server-side for evaluation only.
Directory Structure
VANTAGE-BENCH/
├── vqa/ # Video question answering
├── dense_captioning/ # Dense video captioning
├── temporal_localization/ # Temporal localization
├── event_verification/ # Event verification
├── 2dbbox/ # 2D object localization
├── referring/ # 2D referring expressions
├── pointing/ # 2D spatial pointing
├── tracking/ # Spatio-temporal tracking
└── README.md # Dataset documentation and submission instructions
Evaluation
Tasks and Submission Formats
| Category | Task | Expected Submission Format | Metric |
|---|---|---|---|
| Semantic | VQA | JSON (question-answer pairs) | Accuracy |
| Semantic | Event Verification | Binary labels per video/image (JSON) | F1 Score |
| Temporal | Dense Video Captioning | Timestamped captions (JSON) | SODA-c |
| Temporal | Temporal Localization | Temporal segments with event labels (JSON) | mAP@tIoU |
| Spatial | 2D Object Localization | KITTI format | mAP@IoU |
| Spatial | 2D Referring Expressions | Bounding box predictions (JSON) | Acc@IoU |
| Spatial | 2D Spatial Pointing | Point coordinates (JSON) | Pointing Accuracy |
| Spatial | Spatio-Temporal Tracking | MOT-compatible format | HOTA |
Metric Notes
- Accuracy: Percentage of correct predictions.
- SODA-c: Metric for dense video captioning quality across event coverage and language quality.
- mAP@tIoU: Mean Average Precision measured over temporal IoU thresholds.
- F1 Score: Harmonic mean of precision and recall.
- mAP@IoU: Mean Average Precision measured over spatial IoU thresholds.
- Acc@IoU: Correct grounding if predicted box overlaps target above IoU threshold.
- Pointing Accuracy: Percentage of correctly selected target regions.
- HOTA: Higher Order Tracking Accuracy, combining detection and association quality.
Evaluation Server
Predictions are submitted to the evaluation server hosted on HuggingFace. The server computes metrics against held-out annotations and updates the public leaderboard.
Evaluation server: VANTAGE-Bench
Dataset Format
Video (mp4) and Images (jpg).
Dataset Quantification
| Category | Task | Videos | Entries |
|---|---|---|---|
| Semantic | VQA | 296 | 1,257 |
| Semantic | Event Verification | 163 | 163 |
| Temporal | Dense Video Captioning | 104 | 717 |
| Temporal | Temporal Localization | 221 | 1,280 |
| Spatial | 2D Object Localization | 3 | 27,404 bounding boxes (628 frames) |
| Spatial | 2D Referring Expressions | 1,503 images | 3,276 expressions |
| Spatial | 2D Spatial Pointing | 1,005 | 5,018 images |
| Spatial | Spatio-Temporal Tracking | 200 clips (8 frames/clip) | 200 objects, 1,600 frames |
Total unique videos: 312 (309 across VQA/DVC/Temporal + 3 exclusive 2D BBox) Total entries (VQA + DVC + Temporal): 3,254 Total Data Storage: 42 GB
Potential Known Risks
- Ground truth annotations are not publicly released. All evaluation is performed server-side.
- Some warehouse videos are concatenated clips from longer recording sessions.
Citations
@inproceedings{Fujita2020SODA,
author = {Soichiro Fujita and Tsutomu Hirao and Hidetaka Kamigaito and Manabu Okumura and Masaaki Nagata},
title = {{SODA}: Story Oriented Dense Video Captioning Evaluation Framework},
booktitle = {Proc. ECCV},
year = {2020}
}
@inproceedings{Fu2024BLINK,
author = {Xingyu Fu and Yushi Hu and Bangzheng Li and Yu Feng and Haoyu Wang and Xudong Lin and Dan Roth and Noah A. Smith and Wei-Chiu Ma and Ranjay Krishna},
title = {{BLINK}: Multimodal Large Language Models Can See but Not Perceive},
booktitle = {Proc. ECCV},
year = {2024}
}
@article{Sun2025RefDrone,
author = {Zhichao Sun and Yuda Zou and Xian Sun and Yingchao Feng and Wenhui Diao and Menglong Yan and Kun Fu},
title = {{RefDrone}: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes},
journal = {arXiv preprint arXiv:2502.00392},
year = {2025}
}
TBD: Update with VANTAGE-BENCH paper citation when published.
References
- HuggingFace dataset: nvidia/PhysicalAI-VANTAGE-Bench
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Changelog
- 2026-04-14: Initial dataset release.
- Downloads last month
- 19
