Datasets:

nvidia
/

PhysicalAI-VANTAGE-Bench

Modalities:

Image

Video

Formats:

imagefolder

Size:

1K - 10K

ArXiv:

Libraries:

Datasets

License:

Dataset card Data Studio Files Files and versions

xet

Community

Dataset Viewer

Auto-converted to Parquet Duplicate

Split (1)

train · 1.63k rows

Search is not available for this dataset

image image

End of preview. Expand in Data Studio

VANTAGE-BENCH

Video ANalysis Tasks Across Generalized Environments

Dataset Description

VANTAGE-BENCH is the first public benchmark purpose-built for evaluating visual understanding on video captured by fixed infrastructure cameras. It spans three real-world domains — warehouse, smart city / Intelligent Transportation Systems (ITS), and smart spaces — across six spatio-temporal video understanding tasks including video question answering (VQA), temporal grounding, dense video captioning, event verification, spatial grounding, and spatio-temporal tracking.

This dataset is for evaluation purposes only.

Dataset Owner(s)

NVIDIA Corporation

Dataset Creation Date

April 24, 2026

License/Terms of Use

Visit the NVIDIA Legal Release Process for instructions on getting legal support for a license selection:

Dataset Characterization

Data Collection Method
Hybrid: Human, Synthetic, Automated. Video data is sourced from vendor-provided footage (GoPro captures of warehouse and smart space environments), synthetic generation (DriveSim collision and multi-camera scenarios), and publicly scraped sources (Dubuque highway/ITS footage).

Labeling Method
Hybrid: Human, Synthetic, Pseudolabeled. Annotations for VQA, dense video captions, and temporal localization are primarily human-authored. Spatial grounding labels (2D/3D bounding boxes, referring expressions) use a combination of human annotation and pseudolabeling pipelines (detection + SAM for spatial pointing). Event verification labels are human-curated. Annotations are held server-side for evaluation only.

Directory Structure

VANTAGE-BENCH/
├── vqa/                     # Video question answering
├── dense_captioning/        # Dense video captioning
├── temporal_localization/   # Temporal localization
├── event_verification/      # Event verification
├── 2dbbox/                  # 2D object localization
├── referring/               # 2D referring expressions
├── pointing/                # 2D spatial pointing
├── tracking/                # Spatio-temporal tracking
└── README.md                # Dataset documentation and submission instructions

Evaluation

Tasks and Submission Formats

Category	Task	Expected Submission Format	Metric
Semantic	VQA	JSON (question-answer pairs)	Accuracy
Semantic	Event Verification	Binary labels per video/image (JSON)	F1 Score
Temporal	Dense Video Captioning	Timestamped captions (JSON)	SODA-c
Temporal	Temporal Localization	Temporal segments with event labels (JSON)	mAP@tIoU
Spatial	2D Object Localization	KITTI format	mAP@IoU
Spatial	2D Referring Expressions	Bounding box predictions (JSON)	Acc@IoU
Spatial	2D Spatial Pointing	Point coordinates (JSON)	Pointing Accuracy
Spatial	Spatio-Temporal Tracking	MOT-compatible format	HOTA

Metric Notes

Accuracy: Percentage of correct predictions.
SODA-c: Metric for dense video captioning quality across event coverage and language quality.
mAP@tIoU: Mean Average Precision measured over temporal IoU thresholds.
F1 Score: Harmonic mean of precision and recall.
mAP@IoU: Mean Average Precision measured over spatial IoU thresholds.
Acc@IoU: Correct grounding if predicted box overlaps target above IoU threshold.
Pointing Accuracy: Percentage of correctly selected target regions.
HOTA: Higher Order Tracking Accuracy, combining detection and association quality.

Evaluation Server

Predictions are submitted to the evaluation server hosted on HuggingFace. The server computes metrics against held-out annotations and updates the public leaderboard.

Evaluation server: VANTAGE-Bench

Dataset Format

Video (mp4) and Images (jpg).

Dataset Quantification

Category	Task	Videos	Entries
Semantic	VQA	296	1,257
Semantic	Event Verification	163	163
Temporal	Dense Video Captioning	104	717
Temporal	Temporal Localization	221	1,280
Spatial	2D Object Localization	3	27,404 bounding boxes (628 frames)
Spatial	2D Referring Expressions	1,503 images	3,276 expressions
Spatial	2D Spatial Pointing	1,005	5,018 images
Spatial	Spatio-Temporal Tracking	200 clips (8 frames/clip)	200 objects, 1,600 frames

Total unique videos: 312 (309 across VQA/DVC/Temporal + 3 exclusive 2D BBox) Total entries (VQA + DVC + Temporal): 3,254 Total Data Storage: 42 GB

Potential Known Risks

Ground truth annotations are not publicly released. All evaluation is performed server-side.
Some warehouse videos are concatenated clips from longer recording sessions.

Citations

@inproceedings{Fujita2020SODA,
  author    = {Soichiro Fujita and Tsutomu Hirao and Hidetaka Kamigaito and Manabu Okumura and Masaaki Nagata},
  title     = {{SODA}: Story Oriented Dense Video Captioning Evaluation Framework},
  booktitle = {Proc. ECCV},
  year      = {2020}
}

@inproceedings{Fu2024BLINK,
  author    = {Xingyu Fu and Yushi Hu and Bangzheng Li and Yu Feng and Haoyu Wang and Xudong Lin and Dan Roth and Noah A. Smith and Wei-Chiu Ma and Ranjay Krishna},
  title     = {{BLINK}: Multimodal Large Language Models Can See but Not Perceive},
  booktitle = {Proc. ECCV},
  year      = {2024}
}

@article{Sun2025RefDrone,
  author    = {Zhichao Sun and Yuda Zou and Xian Sun and Yingchao Feng and Wenhui Diao and Menglong Yan and Kun Fu},
  title     = {{RefDrone}: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes},
  journal   = {arXiv preprint arXiv:2502.00392},
  year      = {2025}
}

TBD: Update with VANTAGE-BENCH paper citation when published.

References

HuggingFace dataset: nvidia/PhysicalAI-VANTAGE-Bench

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Changelog

2026-04-14: Initial dataset release.

Downloads last month: 19

Paper for nvidia/PhysicalAI-VANTAGE-Bench

RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

Paper • 2502.00392 • Published Nov 24, 2025