Grounded Human-Attributed Description and Activity Recognition in Videos (GHADAR)
ECCV 2026 (under review)
Abstract
We introduce GHADAR, the task of per-person, open-set attribute and activity description in multi-person videos, together with AVA-Captions — the first large-scale grounded dataset of this kind, built by extending AVA-Actions with VLM-generated captions and identity-aware deduplication. We propose CAMP (Constrained Attention Masking-based Pretraining), a two-stage VLM training strategy that explicitly leverages grounding through attention-mask constraints and outperforms state-of-the-art VLMs, alongside a VLM-driven evaluation framework that compares video and prediction at the concept level rather than via n-gram or embedding metrics.
BibTeX
@inproceedings{nazarovs2026ghadar,
title = {Grounded Human-Attributed Description and Activity Recognition in Videos (GHADAR)},
author = {Nazarovs, Jurijs and others},
booktitle = {Under review at the European Conference on Computer Vision (ECCV)},
year = {2026}
}





