Content area

Abstract

The video captioning task is generating description sentences by learning semantic information. It has a wide range of applications in areas such as video retrieval, automatic generation of subtitles and blind assistance. Visual semantic information plays a decisive role in video captioning. However, traditional methods are relatively rough for video feature modeling, failing to harness local and global features to understand temporal and spatial relationships. In this paper, we propose a video captioning model based on the Transformer and GCN network called “Relation-Enhanced Spatial–Temporal Hierarchical Transformer” (RESTHT). To address the above issues, we present a spatial–temporal hierarchical network framework to jointly model local and global features in terms of both time and space. For temporal modeling, our model learns the direct interactions between diverse video features and sentence features in the temporal sequence via the pre-trained GPT2, and the global feature construction encourages it to capture essential and relevant information. For spatial modeling, we use self-attention and GCN networks to learn the spatial relationship from appearance and motion perspectives jointly. Through spatial–temporal modeling, our method can comprehend the global time–space relationships of complex events in videos and catch the interaction between different objects to generate more accurate descriptions applicable to universal video captioning tasks. We conducted experiments on two widely used datasets, and especially in the MSVD dataset, our model improves the score of CIDEr by 6.1 compared to the baseline and excels present methods by 13. The results verify that our model can fully model the temporal and spatial relationship and outperforms other related models.

Details

Title
RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning
Publication title
Volume
41
Issue
1
Pages
591-604
Publication year
2025
Publication date
Jan 2025
Publisher
Springer Nature B.V.
Place of publication
Heidelberg
Country of publication
Netherlands
Publication subject
ISSN
01782789
e-ISSN
14322315
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-04-18
Milestone dates
2024-03-02 (Registration); 2024-03-01 (Accepted)
Publication history
 
 
   First posting date
18 Apr 2024
ProQuest document ID
3159547827
Document URL
https://www.proquest.com/scholarly-journals/restht-relation-enhanced-spatial-temporal/docview/3159547827/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Jan 2025
Last updated
2025-01-31
Database
ProQuest One Academic