Datasets:

Modalities:
Image
Video
Size:
< 1K
ArXiv:
Libraries:
Datasets
License:
Dataset Viewer
Auto-converted to Parquet Duplicate
Search is not available for this dataset
image
imagewidth (px)
1.92k
4k
End of preview. Expand in Data Studio

VANTAGE-BENCH

Video ANalysis Tasks Across Generalized Environments

Dataset Description

VANTAGE-BENCH is the first public benchmark purpose-built for evaluating visual understanding on video captured by fixed infrastructure cameras. It spans three real-world domains — warehouse, smart city / Intelligent Transportation Systems (ITS), and smart spaces — across six spatio-temporal video understanding tasks including video question answering (VQA), temporal grounding, dense video captioning, event verification, spatial grounding, and spatio-temporal tracking.

This dataset is for evaluation purposes only.

Dataset Owner(s)

NVIDIA Corporation

Dataset Creation Date

April 24, 2026

License/Terms of Use

Visit the NVIDIA Legal Release Process for instructions on getting legal support for a license selection:

Dataset Characterization

Data Collection Method
Hybrid: Human, Synthetic, Automated. Video data is sourced from vendor-provided footage (GoPro captures of warehouse and smart space environments), synthetic generation (DriveSim collision and multi-camera scenarios), and publicly scraped sources (Dubuque highway/ITS footage).

Labeling Method
Hybrid: Human, Synthetic, Pseudolabeled. Annotations for VQA, dense video captions, and temporal localization are primarily human-authored. Spatial grounding labels (2D/3D bounding boxes, referring expressions) use a combination of human annotation and pseudolabeling pipelines (detection + SAM for spatial pointing). Event verification labels are human-curated. Annotations are held server-side for evaluation only.

Directory Structure

VANTAGE-BENCH/
├── vqa/                     # Video question answering
├── dense_captioning/        # Dense video captioning
├── temporal_localization/   # Temporal localization
├── event_verification/      # Event verification
├── 2dbbox/                  # 2D object localization
├── referring/               # 2D referring expressions
├── pointing/                # 2D spatial pointing
├── tracking/                # Spatio-temporal tracking
└── README.md                # Dataset documentation and submission instructions

Evaluation

Tasks and Submission Formats

Category Task Expected Submission Format Metric
Semantic VQA JSON (question-answer pairs) Accuracy
Semantic Event Verification Binary labels per video/image (JSON) F1 Score
Temporal Dense Video Captioning Timestamped captions (JSON) SODA-c
Temporal Temporal Localization Temporal segments with event labels (JSON) mAP@tIoU
Spatial 2D Object Localization KITTI format mAP@IoU
Spatial 2D Referring Expressions Bounding box predictions (JSON) Acc@IoU
Spatial 2D Spatial Pointing Point coordinates (JSON) Pointing Accuracy
Spatial Spatio-Temporal Tracking MOT-compatible format HOTA

Metric Notes

  • Accuracy: Percentage of correct predictions.
  • SODA-c: Metric for dense video captioning quality across event coverage and language quality.
  • mAP@tIoU: Mean Average Precision measured over temporal IoU thresholds.
  • F1 Score: Harmonic mean of precision and recall.
  • mAP@IoU: Mean Average Precision measured over spatial IoU thresholds.
  • Acc@IoU: Correct grounding if predicted box overlaps target above IoU threshold.
  • Pointing Accuracy: Percentage of correctly selected target regions.
  • HOTA: Higher Order Tracking Accuracy, combining detection and association quality.

Evaluation Server

Predictions are submitted to the evaluation server hosted on HuggingFace. The server computes metrics against held-out annotations and updates the public leaderboard.

Evaluation server: VANTAGE-Bench

Dataset Format

Video (mp4) and Images (jpg).

Dataset Quantification

Category Task Videos Entries
Semantic VQA 296 1,257
Semantic Event Verification 163 163
Temporal Dense Video Captioning 104 717
Temporal Temporal Localization 221 1,280
Spatial 2D Object Localization 3 27,404 bounding boxes (628 frames)
Spatial 2D Referring Expressions 1,503 images 3,276 expressions
Spatial 2D Spatial Pointing 1,005 5,018 images
Spatial Spatio-Temporal Tracking 200 clips (8 frames/clip) 200 objects, 1,600 frames

Total unique videos: 312 (309 across VQA/DVC/Temporal + 3 exclusive 2D BBox) Total entries (VQA + DVC + Temporal): 3,254 Total Data Storage: 42 GB

Potential Known Risks

  • Ground truth annotations are not publicly released. All evaluation is performed server-side.
  • Some warehouse videos are concatenated clips from longer recording sessions.

Citations

@inproceedings{Fujita2020SODA,
  author    = {Soichiro Fujita and Tsutomu Hirao and Hidetaka Kamigaito and Manabu Okumura and Masaaki Nagata},
  title     = {{SODA}: Story Oriented Dense Video Captioning Evaluation Framework},
  booktitle = {Proc. ECCV},
  year      = {2020}
}

@inproceedings{Fu2024BLINK,
  author    = {Xingyu Fu and Yushi Hu and Bangzheng Li and Yu Feng and Haoyu Wang and Xudong Lin and Dan Roth and Noah A. Smith and Wei-Chiu Ma and Ranjay Krishna},
  title     = {{BLINK}: Multimodal Large Language Models Can See but Not Perceive},
  booktitle = {Proc. ECCV},
  year      = {2024}
}

@article{Sun2025RefDrone,
  author    = {Zhichao Sun and Yuda Zou and Xian Sun and Yingchao Feng and Wenhui Diao and Menglong Yan and Kun Fu},
  title     = {{RefDrone}: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes},
  journal   = {arXiv preprint arXiv:2502.00392},
  year      = {2025}
}

TBD: Update with VANTAGE-BENCH paper citation when published.

References

VANTAGE-BENCH task overview across Semantic, Temporal, Spatial, and Spatio-Temporal understanding categories

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Changelog

  • 2026-04-14: Initial dataset release.
Downloads last month
6

Paper for nvidia/PhysicalAI-VANTAGE-Bench-Subset