Content area
The video captioning task is generating description sentences by learning semantic information. It has a wide range of applications in areas such as video retrieval, automatic generation of subtitles and blind assistance. Visual semantic information plays a decisive role in video captioning. However, traditional methods are relatively rough for video feature modeling, failing to harness local and global features to understand temporal and spatial relationships. In this paper, we propose a video captioning model based on the Transformer and GCN network called “Relation-Enhanced Spatial–Temporal Hierarchical Transformer” (RESTHT). To address the above issues, we present a spatial–temporal hierarchical network framework to jointly model local and global features in terms of both time and space. For temporal modeling, our model learns the direct interactions between diverse video features and sentence features in the temporal sequence via the pre-trained GPT2, and the global feature construction encourages it to capture essential and relevant information. For spatial modeling, we use self-attention and GCN networks to learn the spatial relationship from appearance and motion perspectives jointly. Through spatial–temporal modeling, our method can comprehend the global time–space relationships of complex events in videos and catch the interaction between different objects to generate more accurate descriptions applicable to universal video captioning tasks. We conducted experiments on two widely used datasets, and especially in the MSVD dataset, our model improves the score of CIDEr by 6.1 compared to the baseline and excels present methods by 13. The results verify that our model can fully model the temporal and spatial relationship and outperforms other related models.
Introduction
Video captioning task refers to that given a video, and the computer outputs descriptions for the video. Most methods [1, 2, 3, 4, 5, 6, 7, 8–9] are based on the encoder-decoder framework. They obtain the video semantic information in the encoder part and perform visual reasoning in the decoder to generate sentences. Distinct from image captioning, video captioning requires joint visual reasoning over space and time.
Most existing video captioning methods [1, 2, 6, 10] usually focus on one side that lacks spatial–temporal joint modeling of video features. Recently, some works [4, 5] employ LSTM as spatial–temporal architecture modeling. Only the previous hidden state is applied to guide the spatial–temporal modeling and integrate different features, failing to make local and global features interact across frames and ignoring small but critical objects around.
In order to tackle the above problems, we propose a novel relation-enhanced spatial–temporal hierarchical transformer: RESTHT. (1) To tackle the inadequate spatial modeling in existing state-of-the-art models, we propose an enhanced spatial relation model to learn comprehensive relation messages from surrounding objects in the encoder, which conducts joint spatial modeling of objects through two perspectives and two distinct types of networks to improve representation ability. On the one hand, we capture the appearance of similar information of objects through a layer of transformer. On the other hand, we build the relation coefficient matrix of different objects through object motion and location information in the GCN network. (2) To address the issue of ignoring small objects and for better local–global modeling, the visual feature context information of the current frame acts as a global guide to select salient ones from several objects. An example is shown in Fig. 1, where a person is cutting ribs with a knife, and the common area between the salient object “ribs” and “knife” is smaller than that of the “chopping board,” resulting in low appearance similarity. However, our network can learn the motion relationships between small but critical objects guided by motion features. (3) To address the lack of interaction of different features, we utilize the pre-trained GPT2 [11] to enable multimodal features at multiple frames to interact on the decoder, which can feed multiple tokens fusing through a multi-level attention mechanism. We take the global and local features of different frames as input tokens into the pre-trained GPT2 to conduct multi-frame joint modeling of the two visual features in time sequence and acquire joint representation of various video features and texts.
Fig. 1 [Images not available. See PDF.]
Example of object detection in video where the small common area between “ribs” and “knife leads to low similarity in appearance
To summarize, our work has the following contributions:
We propose a new spatial−temporal hierarchical transformer framework RESTHT for video captioning. About spatial modeling, we combine transformer and GCN to learn local object relationships from two views. About temporal modeling, we utilize transformer to model local features and global features in time sequence jointly. Through spatial-temporal structure, our model can better understand video content and generate detailed and diverse descriptions for a video.
For better local-global and appearance-motion modeling, we propose an enhanced spatial relation model to jointly learn the spatial relationship from appearance and motion perspectives. Firstly, we can capture the appearance-based spatial relationship between objects through the self-attention mechanism and learn important objects’ local details and relationships through the guided local object association modeling network. Secondly, to obtain the motion-based spatial relationship, we jointly consider the object position and motion-relevant information to build the object-relational graph and learn the critical motion interaction of objects through the GCN network.
In experiments, we present adequate evaluations, and the results show that the proposed RESTHT achieves a comparable and even better performance than the state-of-the-art methods on the two benchmark datasets. Especially in the MSVD dataset, our method outperforms other state-of-the-art by about 13 scores in CIDEr.
Related work
Video captioning
Video captioning is a multimodal generation task with learning the joint representation of video and language to generate video descriptions. It is necessary to conduct joint modeling in temporal and spatial relationships. Generally, global appearance features extracted from a 2D-CNN and motion features extracted from a 3D-CNN are used for temporal model, such as Inception-v1 [12], IncepResNetV2 [13], ResNet [14], VGG [15], C3D [16], I3D [17] and 3D ResNeXt [18], while object features extracted from RCNN [19] are used for spatial model. For future real-world scenarios, we can refer to the real-time detection model, for example, the DenseSPH-YOLOv5 [20], which enhanced its performance by introducing additional detection heads to better focus on small objects. We provide relevant background on previous work exploiting local and temporal features.
Object-level spatial modeling
In order to learn the interaction relationship and local details of objects, most networks have used pre-trained object detectors to extract object features and learn the interactions between local objects. ORG-TRL [6] constructed a learnable object-relational graph through cosine distance between the appearance features of objects and employed GCN to learn the local object relationship. OA-BTG [9] introduced a bidirectional time graph to capture the temporal evolutions of salient objects in video and designed VLAD to update behavior characteristics. STG-KD [10] used the normalized Intersection over Union (IoU) value between two objects to build a spatial graph. Regarding graph modeling, we concern with multiple graph learning to leverage the supplementary data from multiple graphs. For example, MGLNN [21] incorporated multiple graph information and acquired an optimal graph data representation for semi-supervised classification tasks, which may be used in spatial modeling by learning multiple objects' graphs.
Temporal modeling.
Most networks fetch temporal information of keyframes by extracting each keyframe's appearance and motion features as global features. The global features are widely used in video captioning owing to its rich visual semantics. In the early days, [1] obtained video representation by fusing global video features in the global pooling layer, destroying global frames' temporal structure. Then, in order to obtain more video representation from global features, HTM [4] and OA-BTG [9] utilized LSTM to learn the time dependence of global features at frame-level. They generated hidden states at each time step, summarizing the previous frame information. ORG-TRL [6], STAT [5] and OpenBook [22] directly operated on global features. Ultimately, they applied previous hidden state to aggregate global information through an attention mechanism and inputted it into a language decoder based on RNNs. While some networks used transformer, MART [23] directly inputted the 2DCNN features into the transformer for training.
Regarding spatial modeling, most networks [6, 9, 10] are not guided by motion features and learn the spatial relationship from a singular perspective and in a single network. It is easy to ignore objects with strong correlations in motion but small common areas. To address the problems, our model learns the spatial interactive information of objects from multiple views. We capture the appearance relationship of objects through self-attention and learn the motion relationship of objects in GCN. Regarding temporal modeling, the global features are not modelled fully and lack interaction with different video features in most networks [1, 4, 5–6, 9]. Meanwhile, the transformer-based model [23] with excellent temporal modeling has yet to be well explored for video captioning. To tackle the shortcoming, we employ pre-trained GPT2 as our temporal model and language decoder and feed multiple tokens to model the appearance, motion and region features of different frames, which can make different features interact across frames and capture long-term dependency and interaction.
Transformer
Transformer [24] has been widely used in natural language processing, such as machine translation, pre-trained language modeling, text generation. It can capture the global and remote dependency of input tokens through the attention mechanism, which breaks the limitation that the RNN model cannot be used for parallel computing. In recent years, transformer has become popular in computer vision. VIT [25] and other networks are widely used in image classification tasks. VIVIT [26] and TimeSformer [27] have also achieved good results in video classification by spatial–temporal modeling. However, large models such as VIT [25] are difficult to optimize for small datasets, and the results are often lower than the model based on CNNs. Furthermore, the training of models at the pixel level necessitates robust hardware capabilities. Therefore, we extract video features through CNN and feed them into transformer, which can preserve the biased induction of convolution and model global dependency through the self-attention mechanism.
Methods
Overview
As shown in Fig. 2, our model consists of a local encoder and a multimodal decoder. In the local encoder part, we propose an enhanced spatial relation model to learn the spatial relationship of objects from two views. We employ a one-layer transformer to model the appearance relationship of local features and use the GCN network to model objects' position and motion relationships. The RNN network is applied to obtain the context information of appearance features of the current frame to choose salient objects. In the multimodal decoder part, we input each frame's local and global features and caption to the model implemented by GPT2 to learn joint representation between visual features and sentences to decode the sentences and reconstruct the global visual features.
Fig. 2 [Images not available. See PDF.]
Overview of the proposed RESTHT: Relation-Enhanced Spatial–Temporal Hierarchical Transformer. We use two modules, Appearance Spatial Modeling (ASM) and Motion Spatial Modeling (MSM), to jointly model spatial relationships from appearance and motion perspectives through self-attention and GCN. A temporal decoder is employed to acquire a joint representation and facilitate interactions among various features within a time sequence
Local encoder
In the local encoder part, we propose an enhanced spatial relation model to mainly model the spatial relationship of objects to augment each local feature. We first extract a certain number of objects in each frame and learn the relationship and local details of objects through the spatial relation network. For a given series of video frames, we uniformly extract keyframes for global features and keyframes for local features. We extract appearance features from 2D-CNN, motion features from 3D-CNN and adopt the pre-trained Faster-RCNN to extract local features representing the object in the keyframe. We reduce the dimensionality of local features through a fully connected layer:
1
where and are learnable parameters.Appearance-based spatial relation modeling
Figure 3 shows our method for spatial relation modeling (ASM) in detail. The appearance features of local objects with interactive relationships have high similarity. To model the appearance relationship of objects, we input the local features of objects in each frame into the transformer. We perform weighted fusion on related objects by calculating the appearance similarity of objects as the attention weight. Because we employ one layer, we do not adopt residual connection. Its calculation formulas are as follows:
2
3
4
5
where are learnable weights, are learnable bias. denotes layer normalization[28]. Motion-based spatial relation modeling.Fig. 3 [Images not available. See PDF.]
Overview of the proposed Appearance Spatial Modeling (ASM). We learn the appearance relationship of objects through a transformer block
To learn more related messages in motion from surrounding objects, we jointly consider relative spatial location information and motion-relevant information between two object regions and . An overview of Motion Spatial Modeling (MSM) is shown in Fig. 4. We compute the location relation through the areas of two objects. We define the location association weight of to as follows:
Fig. 4 [Images not available. See PDF.]
Overview of the proposed Motion Spatial Modeling (MSM). We learn the motion relationship of objects through GCN under the guide of motion feature and location information of objects
6
The indicates the influence of to from the location perspective. Thus, the location scores matrix , is constructed, and the two objects with larger overlapping areas exhibit a stronger correlation.
To consider the surrounding objects with low appearance and location correlation but high motion correlation, we use the motion feature to guide the calculation of motion correlation weight. We combine the aggregated object feature with motion feature as a query and take the aggregated object feature as a key to calculate the motion correlation weight:
7
8
9
where ,,,, are parameters. denotes object nodes with -dimensional feature aggregated by self-attention mechanism. contains the surrounding scene information, which is conducive to inferring the objects with high motion correlation. computes the relevant score between the motional scene features of each object and the scene features of each object . Then, we mask according to matrix:10
After the mask process, the relevance scores are normalized as attention scores matrix :
11
For given objects, we treat each object as a node. We design a motional relation graph to aggregate the object features in GCN. The motional relation graph combines the location information and motion information, which is defined as follows:
12
We employ GCN for motion-relevant reasoning and the update of object feature :
13
where is a layer-specific trainable weight matrix, denotes activation function and . denotes the matrix of activation in the layer and . Then, we fuse the local features aggregated by the appearance and motion information in two networks:14
where and are learnable parameters and denotes the concatenation of the two matrices.In order to obtain the spatial relationships and detailed features of important objects, we apply the appearance features of the video to guide the network. We input the appearance feature to the RNN to obtain the context information of as a guide:
15
where is the 2D-CNN feature of the current frame, is the hidden state at the previous time, is the hidden state at this moment, and and are learnable parameters. Then, we input the local feature of the frame as a key and value and the hidden state as a query into a transformer layer, where is the hidden layer dimension:16
17
18
where are learnable weights; are learnable bias. Finally, we can obtain the updated local featureMultimodal decoder
As depicted in Fig. 5, the multimodal decoder is a pre-trained GPT2 model with a two-layer transformer. Following [29], we introduced two tasks to fine-tune our model. One task is to reconstruct the original global features, given the caption and other global features. Another task is to generate captions based on video features and previously generated words.
Fig. 5 [Images not available. See PDF.]
Overview of the proposed Multimodal Decoder (pre-trained GPT2)
To obtain more information, inspired by several prior works [30], we also obtain semantic information in a multi-label classification task. We treat the predicted probability distribution of attributes as semantic feature . We concatenate video global features , and local features with semantic features respectively. We employ a fully connected layer to transform the appearance, motion and local features into video tokens , and of the same dimension. And the words are embedded into textual tokens :
19
20
21
22
where ,,, and are learnable parameters, and is a learnable word embedding matrix. We obtain the final representation after summing up positional encoding and segment embedding:23
24
where and denote video segment embedding and caption segment embedding, respectively, is positional encoding matrix, and denotes video tokens. All input tokens consist video tokens and textual tokens. denotes the concatenation operation. We feed all tokens into the transformer blocks of GPT2:25
26
27
where , is the number of layers, is the output target textual token corresponding to input textual token , is the probability distribution of target word, is the output target video token corresponding to input token, is the reconstructing feature with the same dimensional with and and are learnable parameters.In the language modeling, we minimize the negative log-likelihood loss function to achieve the caption generation task given a video clip v and previous words :
28
where is the predicted word at time step, is probability distribution and denotes the length of the sentence.In the video reconstruction task, we train the model by minimizing loss:
29
The overall loss is the combination of and :
30
Experiments
Dataset
We use widely used metrics, including BLEU4[31], METEOR[32], ROUGE-L [33] and CIDEr[34], to evaluate our model on two video captioning datasets. The higher scores can demonstrate that our model achieves better performance.
We conduct experiments on two benchmark datasets, detailed below.
MSVD [35] is a collection of 1970 open-domain video clips collected from YouTube. Each video is 10 to 25 s long, with roughly 40 English captions. We used standard segmentation to divide the dataset into 1,200 training videos, 100 verification videos, and 670 test videos.
MSR-VTT [36] is another widely used video captioning dataset. It comprises 10,000 open-domain videos with 20 categories, and 20 English descriptions accompany each video. Following the existing works, we used the standard segmentation for fair comparison and divided the dataset into 6513 training videos, 497 verification videos, and 2990 test videos.
Implementation details
Feature extraction
Following [37] to process the video, we uniformly select 13 keyframes in each video for extracting global features and use InceptionResNetV2 [13] and I3D [17] as the 2D and 3D CNN feature extractors. Unlike global features, we sample 5 keyframes in each video and apply FatserRCNN [19] trained by [38] to extract 6 region features per frame.
Training and model details
The optimizer of our model is Adam Optimizer [39], and we reduce learning rates from 1e-3 to 6.4e-4. The transformer of the local encoder has a hidden state size of 256 and 8 attention heads. The GCN of the local encoder has 1 layer and a hidden state size of 512. In particular, we set the dimension of local and global features and word embedding size as 768. In the multimodal decoder, the number of transformer layers is 2. We train the model on two datasets for 20 epochs with the batch size of 8.
Ablation study
To verify the effectiveness of each part of the model, we set up a group of ablation experiments for comparison, including the following three structures, denoted as Baseline (only temporal decoder), Baseline + ASM (Appearance spatial modeling), and Baseline + ASM + MSM (Motion spatial modeling).
Baseline: In the baseline, we only feed the appearance, motion, and semantic features into the multimodal decoder as shown in Fig. 5 to validate the effectiveness of global features and the intersection across frames. Through the attention mechanism, the baseline can learn the interaction among global features and captions. It also can utilize rich semantic prior knowledge of GPT2 to generate better descriptions.
Baseline + ASM: In the model, we deploy appearance spatial modeling with the multimodal decoder to validate the effectiveness of appearance spatial modeling of local features. We augment a transformer layer to learn the appearance spatial relationship through appearance similarity between local objects.
Baseline + ASM + MSM: ASM structure only models objects from appearance similarity. In the model, we conduct more comprehensive spatial modeling and augment GCN to learn the motion spatial relationship through location information and motion relationship.
Effect of appearance spatial modeling.
Comparing the results of Baseline and Baseline + ASM, we observe that Baseline + ASM outperforms Baseline in most metrics on both datasets, especially MSVD. The results indicate the effectiveness of appearance spatial modeling in local features. Through a transformer layer, the object can explore the visual association with other objects. Subsequently, the context information of the appearance feature can select the salient objects of current frame. The salient object, updated by ASM, contains the object's visual scene information. The added local information helps the model locate the salient object and learn the main events of the salient object in the temporal interactions of video features.
Effect of motion spatial modeling
From Table 1, it is evident that the overall model outperforms the two previous models. The overall model combines both appearance spatial modeling and motion spatial modeling in two distinct types of networks. It takes into account not only appearance associations but also motion associations, leading to a deeper understanding of the relationships between the salient object and the surrounding objects.
Table 1. Results of the ablation study with various settings on the MSVD and MSR-VTT datasets
Dataset | Method | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | Baseline | 57.6 | 76.9 | 39.6 | 110.5 |
Baseline + ASM | 57.6 | 76.8 | 40.5 | 111.9 | |
Baseline + ASM + MSM | 58.1 | 77.8 | 40.9 | 116.6 | |
MSR-VTT | Baseline | 41.9 | 61.4 | 28.4 | 50.3 |
Baseline + ASM | 43.3 | 62.2 | 29.3 | 50.7 | |
Baseline + ASM + MSM | 43.5 | 62.1 | 29.4 | 51.5 |
The highest score is highlighted in bold among the metrics
Impact of the different sample rate of local feature
We also investigated the influence of the video frame sampling rate on our model's local features. We uniformly sampled frames and conducted a series of experiments on the MSVD dataset. Table 2 illustrates the performance of models at different sample rates. From Table 2, we can know a moderate number of video frames provides better performance. A low sampling rate results in a lack of detailed object information, which makes it challenging to conduct spatial modeling. On the other hand, a high sampling rate brings redundant and noisy information, and an excess of tokens hampers the capture of essential information. In our model, we set as the sampling rate of local features to achieve better performance and save time and computing costs.
Table 2. Impact of the different sample rates of local features on the MSVD dataset
Dataset | Sample rate of local feature | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | 3 | 57.6 | 77.1 | 40.4 | 112.7 |
5 | 58.1 | 77.8 | 40.9 | 116.6 | |
7 | 58.2 | 77.3 | 40.6 | 111.6 |
The highest score is highlighted in bold among the metrics
Impact of the input ways of the multimodal decoder
We explored whether inputting each feature separately as a video token to the decoder is more helpful. We employed another method for inputting video tokens as our comparison experiment. In this method, we sampled 26 video frames of global features and local features, concatenated and from the same frame in the feature dimension, and input them to the decoder. In the end, the video tokens needed to reconstruct the global feature of the current frame. Table 3 shows the results of two input ways. We observe that our input way is more effective, facilitating the interaction between various features across frames and preserving feature information. Additionally, the task of reconstructing the global feature encourages the appearance feature and action feature to learn visual semantics from each other, effectively guiding the feature interaction. In contrast, another input method primarily focused on learning temporal relationships without achieving effective feature interactions.
Table 3. Impact of the input ways of multimodal decoder
Dataset | Input way | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | Joint | 55.9 | 76.3 | 39.5 | 110.6 |
Separate | 58.1 | 77.8 | 40.9 | 116.6 |
“Joint” indicates that a video token consists of the overall video feature in the same frame. “Separate” indicates that the appearance feature, motion feature and local feature form video tokens individually
The highest score is highlighted in bold among the metrics
Impact of the transformer layer numbers in the multimodal decoder
To analyze the influence of the number of transformer layers in the multimodal decoder, we conducted some experiments with different numbers of transformer layers in the decoder. Table 4 demonstrates the excellent performance of the decoder with 2 transformer layers. The video datasets are too large to train many layers, which is prone to overfitting when using 3 or 4 transformer layers. On the other hand, having too few layers makes it challenging to perform language modeling. Therefore, we chose to set 2 transformer layers in the multimodal decoder to mitigate these issues.
Table 4. Impact of the transformer layer numbers in the multimodal decoder
Dataset | Transformer layer numbers in decoder | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | 1 | 54.5 | 76.0 | 39.2 | 108.6 |
2 | 58.1 | 77.8 | 40.9 | 116.6 | |
3 | 57.8 | 77.4 | 40.6 | 113.5 | |
4 | 58.7 | 77.0 | 39.9 | 111.1 |
The highest score is highlighted in bold among the metrics
Comparison with the state-of-the-art
To verify the model's performance, we compared the model with other state-of-the-art technologies on two datasets, and the results are presented in Table 5.
Table 5. Comparison with state-of-the-art methods on the MSVD and MSR-VTT datasets
Dataset | Method | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | OA-BTG [9] | 56.9 | – | 36.2 | 90.6 |
HTM [4] | 54.7 | 72.5 | 35.2 | 91.3 | |
ORG-TRL [6] | 54.3 | 73.9 | 36.4 | 95.2 | |
STG-KD [10] | 52.2 | 73.9 | 36.9 | 93.0 | |
SAM-SS [30] | 61.8 | 76.8 | 37.8 | 103.0 | |
PMI-CAP [40] | 54.6 | – | 36.4 | 95.1 | |
UniVL [41] | 56.9 | 75.8 | 38.8 | 102.0 | |
RMN [37] | 54.6 | 73.4 | 36.5 | 94.4 | |
SGN [42] | 52.8 | 72.9 | 35.5 | 94.3 | |
SHAN [43] | 54.3 | 72.2 | 35.3 | 91.3 | |
PDA [44] | 58.7 | 74.8 | 37.6 | 100.3 | |
Ours | 58.1 | 77.8 | 40.9 | 116.6 | |
MSR-VTT | OA-BTG [9] | 41.4 | – | 28.2 | 46.9 |
HTM [4] | – | – | – | – | |
ORG-TRL [6] | 43.6 | 62.1 | 28.8 | 50.9 | |
STG-KD [10] | 40.5 | 60.9 | 28.3 | 47.1 | |
SAM-SS [30] | 43.8 | 62.4 | 28.9 | 51.4 | |
PMI-CAP [40] | 42.1 | - | 28.7 | 49.4 | |
UniVL [41] | 41.6 | 62.1 | 29.4 | 52.8 | |
RMN [37] | 42.5 | 61.6 | 28.4 | 49.6 | |
SGN [42] | 40.8 | 60.8 | 28.3 | 49.5 | |
SHAN [43] | 39.7 | 60.4 | 28.3 | 49.0 | |
PDA [44] | 43.8 | 62.1 | 28.8 | 51.2 | |
Ours | 43.5 | 62.1 | 29.4 | 51.5 |
The highest score is highlighted in bold among the metrics
As shown in Table 5, we can conclude that the spatial–temporal structure, which better comprehends video content, is more effective and comprehensive. This structure jointly captures the temporal and spatial information and interactions across frames, contributing to a better comprehension of video features and yielding favorable scores on most metrics. In particular, the CIDEr metric, proposed exclusively for captioning tasks and focusing on the video's main content shows that our method achieves superior results compared to previous methods.
Comparison to recent multimodal transformer models
To validate our approach's effectiveness, we conducted a comparison with several multimodal transformer models, such as TokenFusion [45] and CMNeXt [46]. For fairness, we input one-dimensional video feature instead of two-dimensional video frames, as our method did. Moreover, we only input 13 frames of appearance and motion features, matching the requirement of the same number of different modal images in TokenFusion [45] and CMNeXt [46]. To adapt to our task, we used the multimodal transformer models to model the appearance feature and motion feature of the video, replaced a portion of the CNN with FNN, and finally input the processed feature into the decoder for caption generation. Meanwhile, we compared two input ways of the multimodal decoder in TokenFusion [45], whereas CMNeXt [46] fuses different modal features in the feature rectification and feature fusion modules. As shown in Table 6, our model exhibits superior performance. It proves that the video features interact comprehensively in our model guided by global features reconstruction. It is worth noting that our input way is equally applicable to TokenFusion [45] and excels another way in all metrics.
Table 6. Comparison to recent multimodal transformer models on the MSVD dataset
Dataset | Method | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | TokenFusion [45](Separate) | 54.9 | 75.3 | 38.7 | 104.3 |
TokenFusion [45](Joint) | 52.1 | 74.5 | 37.7 | 96.9 | |
CMNeXt [46] | 55.0 | 75.8 | 39.5 | 105.2 | |
Ours-baseline | 57.6 | 76.9 | 39.6 | 110.5 |
The highest score is highlighted in bold among the metrics
Comparison to multiple graph spatial modeling and motion spatial modeling
To validate the effectiveness of multiple graphs of MGLNN [21], we made a comparison with multiple graph spatial modeling and motion spatial modeling. Due to the absence of multi-graph labels, we took the location scores matrix as the interaction graph and constructed the distance graph by calculating the Euclidean distance between objects. Following MGLNN [21], we introduced a learnable graph to fuse the information from multiple graphs, instead of employing motion spatial modeling. The remaining settings of the network were consistent with those in RESTHT. As shown in Table 7, our model demonstrates superior performance. It suggests our motion spatial modeling learns correct spatial relationships guided by motion features. Although the learnable graph integrates the information from multiple graphs, it is important to note that different videos have different spatial relation graphs, and remains static after training. In contrast, our motion relation graph adapts to the motion features and object features, preferable to our task.
Table 7. Comparison to multiple graph spatial modeling and motion spatial modeling on the MSVD dataset
Dataset | Method | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|---|---|---|---|---|---|
MSVD | Multiple graphs | 57.0 | 77.4 | 40.6 | 114.0 |
Ours | 58.1 | 77.8 | 40.9 | 116.6 |
The highest score is highlighted in bold among the metrics
Complexity analysis
To provide a thorough assessment of the efficiency and complexity of our method, we conducted a comparison with several state-of-the-art methods for which source codes are accessible on the MSVD dataset about the number of parameters and running time. As shown in Table 8, in a certain performance range (score of CIDEr > 100), our network requires fewer parameters for training compared to prior works, leading to faster processing speed. Compared to RMN [37], our model has significantly improved scores across all metrics at the cost of slightly increasing model complexity and runtime. Nonetheless, it is worth noting that SGN [42] is an exception, with minimal parameters and lower speed. This variance can be attributed to the time-consuming process of semantic grouping applied in SGN [42]. Overall, the experimental results strongly demonstrate that our method not only attains superior performance but also has the higher efficiency.
Table 8. Complexity and efficiency analysis of the number of parameters and testing time on the MSVD dataset
Methods | MSVD Dataset | Complexity | ||||
|---|---|---|---|---|---|---|
BLEU-4 | ROUGE-L | METEOR | CIDEr | Pram. Num | Testing time | |
SGN [42] | 52.8 | 72.9 | 35.5 | 94.3 | 14.0M | 84.9s |
RMN [37] | 54.6 | 73.4 | 36.5 | 94.4 | 47.1M | 43.7s |
SAM-SS [30] | 61.8 | 76.8 | 37.8 | 103.0 | 188.8 M | 94.4 s |
UniVL [41] | 56.9 | 75.8 | 38.8 | 102.0 | 255.0M | 230.7s |
Ours | 58.1 | 77.8 | 40.9 | 116.6 | 70.9M | 70.5s |
The highest score is highlighted in bold among the metrics of BLEU-4, ROUGE-L, METEOR, CIDEr. The minimum number of parameter is highlighted in bold about Pram. Num and the shortest testing time is highlighted in bold about Testing time
Qualitative analysis
Figure 6 shows examples of generated captions for qualitative analysis on the MSVD and MSR-VTT datasets. The results clearly indicate that, when compared with the baseline method, our proposed method can generate accurate descriptions and capture precise information about fine-grained objects (e.g., “oil”, “track”, and “fingernails”) with the complement of local features and deduce the interaction and relationships between objects (e.g., “making a craft” and “paint fingernails”). It suggests that our enhanced spatial and temporal joint modeling are helpful in understanding the video contents. Moreover, our model tends to generate long and detailed sentences that closely approximate the quality and length of ground truth captions. This indicates that our model exhibits a stronger capacity for learning and generating captions.
Fig. 6 [Images not available. See PDF.]
Examples of generated video captions on the MSVD and MSR-VTT. We make a comparison between the baseline method, our method and ground truth
Limitation and future work
Currently, the vocabulary capacity in the video descriptions is relatively tiny, and the descriptions of sentences are too singular, unable to meet the application requirements of real-life scenarios. Furthermore, the knowledge domain remains fixed after model training, making expansion impossible. To tackle these issues, we plan to design a retrieval network to search for additional vocabulary and sentence supplements in a larger corpus or relevant documents in the next phase. It will allow us to expand the knowledge domain without further training and generate a more diverse vocabulary.
Simultaneously, the video caption task needs to concern the video and text information jointly. However, our model mainly leverages the video feature, ignores the part-of-speech (POS) tag labels, and fails to consider that the word has a different trend with the video feature. To address these issues, in the next phase, we will design a network to predict part-of-speech tags and generate the content words, using these tags to guide the integration of various visual features and content words to generate descriptive sentences.
Conclusion
In this paper, we propose a new transformer-based video captioning model (Relation-Enhanced Spatial–Temporal Hierarchical Transformer), which can more effectively learn the semantic information of video through joint spatial–temporal modeling. About spatial modeling, to tackle the inadequate spatial modeling, we model the spatial relationship between objects in each frame through two modules, ASM and MSM, and select the salient local objects through the context information of appearance features from the perspective of appearance and motion association. About temporal modeling, to address the lack of interaction of different features, we input multiple video features as separate tokens into the decoder and through video feature reconstruction to guide their interaction across frames. Experiments on two widely used datasets verify the effectiveness of our model. Especially in the MSVD dataset, our method excels other state-of-the-art by about 13 scores in CIDEr.
Acknowledgements
This work is supported by Beijing Natural Science Foundation 4242028, the NSFC 62006015, 61971446, and CEFLA Audio-Video Restoration and Evaluation Key Lab ofMinistry of Culture and Tourism.
Data availability
The data included in this study may be available from the corresponding author on reasonable request.
Declarations
Conflict of interest
No potential conflict of interest was reported by the authors with regard to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence-video to text[C]. In: Proceedings of the IEEE international conference on computer vision. 2015: 4534–4542. https://doi.org/10.1109/iccv.2015.515
2. Pan, P., Xu, Z., Yang, Y., et al.: Hierarchical recurrent neural encoder for video representation with application to captioning[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 1029–1038. https://doi.org/10.1109/cvpr.2016.117
3. Peng, Y; Wang, C; Pei, Y et al. Video captioning with global and local text attention[J]. Vis. Comput.; 2022; 38,
4. Hu, Y., Chen, Z., Zha, Z. J., et al.: Hierarchical global-local temporal modeling for video captioning[C]. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019: 774–783. https://doi.org/10.1145/3343031.3351072
5. Yan, C; Tu, Y; Wang, X et al. STAT: Spatial-temporal attention mechanism for video captioning[J]. IEEE Trans. Multim.; 2019; 22,
6. Zhang, Z., Shi, Y., Yuan, C., et al.: Object relational graph with teacher-recommended learning for video captioning[C]. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 13278–13288. https://doi.org/10.1109/cvpr42600.2020.01329
7. Sun, B; Wu, Y; Zhao, Y et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning[J]. Visual Comput.; 2022; [DOI: https://dx.doi.org/10.1007/s00371-021-02309-w]
8. Du, X; Yuan, J; Hu, L et al. Description generation of open-domain videos incorporating multimodal features and bidirectional encoder[J]. Vis. Comput.; 2019; 35, pp. 1703-1712. [DOI: https://dx.doi.org/10.1007/s00371-018-1591-x]
9. Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning[C]. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 8327–8336. https://doi.org/10.1109/cvpr.2019.00852
10. Pan, B., Cai, H., Huang, D. A,, et al.: Spatio-temporal graph for video captioning with knowledge distillation[C]. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10870–10879. https://doi.org/10.1109/cvpr42600.2020.01088
11. Radford, A; Wu, J; Child, R et al. Language models are unsupervised multitask learners[J]. OpenAI blog; 2019; 1,
12. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1–9. https://doi.org/10.1109/cvpr.2015.7298594
13. Szegedy, C., Ioffe, S., Vanhoucke, V., et al.: Inception-v4, inception-resnet and the impact of residual connections on learning[C]. In: Thirty-first AAAI conference on artificial intelligence. 2017. https://doi.org/10.1609/aaai.v31i1.11231
14. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770–778. https://doi.org/10.1109/cvpr.2016.90
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. https://doi.org/10.48550/arXiv.1409.1556
16. Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks[C]. In: Proceedings of the IEEE international conference on computer vision. 2015: 4489–4497. https://doi.org/10.1109/iccv.2015.510
17. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299–6308. https://doi.org/10.1109/cvpr.2017.502
18. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?[C]. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018: 6546–6555. https://doi.org/10.1109/cvpr.2018.00685
19. Ren, S; He, K; Girshick, R et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Adv. Neural Info. Process. Syst.; 2015; [DOI: https://dx.doi.org/10.1109/tpami.2016.2577031]
20. Roy, AM; Bhaduri, J. DenseSPH-YOLOv5: An automated damage detection model based on DenseNet and Swin-Transformer prediction head-enabled YOLOv5 with attention mechanism[J]. Adv. Eng. Inform.; 2023; 56, [DOI: https://dx.doi.org/10.1016/j.aei.2023.102007]
21. Jiang, B; Chen, S; Wang, B et al. MGLNN: Semi-supervised learning via multiple graph cooperative learning neural networks[J]. Neural Netw.; 2022; 153, pp. 204-214. [DOI: https://dx.doi.org/10.1016/j.neunet.2022.05.024]
22. Zhang, Z., Qi, Z., Yuan, C., et al.: Open-book video captioning with retrieve-copy-generate network[C]. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021 https://doi.org/10.1109/cvpr46437.2021.00971
23. Lei, J., Wang, L., Shen, Y., et al.: Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning[J]. arXiv preprint arXiv:2005.05402, 2020. https://doi.org/10.18653/v1/2020.acl-main.233
24. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural Info. Process. Syst. 2017, 30.
25. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
26. Arnab, A., Dehghani, M., Heigold, G., et al.: Vivit: A video vision transformer[C]. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6836–6846. https://doi.org/10.1109/iccv48922.2021.00676
27. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?[C]. In: ICML. 2021, 2(3): 4. https://doi.org/10.48550/arXiv.2102.05095
28. Ba, J. L., Kiros, J. R,, Hinton, G. E.: Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016. https://doi.org/10.48550/arXiv.1607.06450
29. Li, Z; Li, Z; Zhang, J et al. Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog[J]. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 29, pp. 2476-2483. [DOI: https://dx.doi.org/10.1109/taslp.2021.3065823]
30. Chen, H; Lin, K; Maye, A et al. A semantics-assisted video captioning model trained with scheduled sampling[J]. Front. Robot. AI; 2020; 7, [DOI: https://dx.doi.org/10.3389/frobt.2020.475767] 475767.
31. Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation[C]. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002: 311–318. https://doi.org/10.3115/1073083.1073135
32. Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005: 65–72
33. Lin, C. Y.: Rouge: A package for automatic evaluation of summaries[C]//Text summarization branches out. 2004: 74–81
34. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 4566–4575. https://doi.org/10.1109/cvpr.2015.7299087
35. Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation[C]. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 2011: 190–200
36. Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 5288–5296. https://doi.org/10.1109/cvpr.2016.571
37. Tan, G., Liu, D., Wang, M., et al.: Learning to discretely compose reasoning module networks for video captioning[J]. arXiv preprint arXiv:2007.09049, 2020. https://doi.org/10.24963/ijcai.2020/104
38. Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077–6086. https://doi.org/10.1109/cvpr.2018.00636
39. Kingma, D. P., Ba, J.: Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014. https://doi.org/10.48550/arXiv.1412.6980
40. Chen, S., Jiang, W., Liu, W., et al.: Learning modality interaction for temporal sentence localization and event captioning in videos[C]. In: European Conference on Computer Vision. Springer, Cham, 2020: 333–351. https://doi.org/10.1007/978-3-030-58548-8_20
41. Luo, H., Ji, L., Shi, B., et al.: Univl: A unified video and language pre-training model for multimodal understanding and generation[J]. arxiv preprint arxiv:2002.06353, 2020
42. Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning[C]. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 2514–2522. https://doi.org/10.1609/aaai.v35i3.16353
43. Deng, J; Li, L; Zhang, B et al. Syntax-guided hierarchical attention network for video captioning[J]. IEEE Trans. Circuits Syst. Video Technol.; 2021; 32,
44. Wang, L; Li, H; Qiu, H et al. Pos-trends dynamic-aware model for video caption[J]. IEEE Trans. Circuits Syst. Video Technol.; 2021; [DOI: https://dx.doi.org/10.1109/tcsvt.2021.3131721]
45. Wang, Y., Chen, X., Cao, L., et al.: Multimodal token fusion for vision transformers[C]. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 12186–12195. https://doi.org/10.1109/cvpr52688.2022.01187
46. Zhang, J., Liu, R., Shi, H., et al.: Delivering Arbitrary-Modal Semantic Segmentation[C]. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 1136–1147. https://doi.org/10.1109/cvpr52729.2023.00116
Copyright Springer Nature B.V. Jan 2025