RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Abstract

The video captioning task is generating description sentences by learning semantic information. It has a wide range of applications in areas such as video retrieval, automatic generation of subtitles and blind assistance. Visual semantic information plays a decisive role in video captioning. However, traditional methods are relatively rough for video feature modeling, failing to harness local and global features to understand temporal and spatial relationships. In this paper, we propose a video captioning model based on the Transformer and GCN network called “Relation-Enhanced Spatial–Temporal Hierarchical Transformer” (RESTHT). To address the above issues, we present a spatial–temporal hierarchical network framework to jointly model local and global features in terms of both time and space. For temporal modeling, our model learns the direct interactions between diverse video features and sentence features in the temporal sequence via the pre-trained GPT2, and the global feature construction encourages it to capture essential and relevant information. For spatial modeling, we use self-attention and GCN networks to learn the spatial relationship from appearance and motion perspectives jointly. Through spatial–temporal modeling, our method can comprehend the global time–space relationships of complex events in videos and catch the interaction between different objects to generate more accurate descriptions applicable to universal video captioning tasks. We conducted experiments on two widely used datasets, and especially in the MSVD dataset, our model improves the score of CIDEr by 6.1 compared to the baseline and excels present methods by 13. The results verify that our model can fully model the temporal and spatial relationship and outperforms other related models.

Full text

Translate

Turn on search term navigation

Introduction

Video captioning task refers to that given a video, and the computer outputs descriptions for the video. Most methods [1, 2, 3, 4, 5, 6, 7, 8–9] are based on the encoder-decoder framework. They obtain the video semantic information in the encoder part and perform visual reasoning in the decoder to generate sentences. Distinct from image captioning, video captioning requires joint visual reasoning over space and time.

Most existing video captioning methods [1, 2, 6, 10] usually focus on one side that lacks spatial–temporal joint modeling of video features. Recently, some works [4, 5] employ LSTM as spatial–temporal architecture modeling. Only the previous hidden state is applied to guide the spatial–temporal modeling and integrate different features, failing to make local and global features interact across frames and ignoring small but critical objects around.

In order to tackle the above problems, we propose a novel relation-enhanced spatial–temporal hierarchical transformer: RESTHT. (1) To tackle the inadequate spatial modeling in existing state-of-the-art models, we propose an enhanced spatial relation model to learn comprehensive relation messages from surrounding objects in the encoder, which conducts joint spatial modeling of objects through two perspectives and two distinct types of networks to improve representation ability. On the one hand, we capture the appearance of similar information of objects through a layer of transformer. On the other hand, we build the relation coefficient matrix of different objects through object motion and location information in the GCN network. (2) To address the issue of ignoring small objects and for better local–global modeling, the visual feature context information of the current frame acts as a global guide to select salient ones from several objects. An example is shown in Fig. 1, where a person is cutting ribs with a knife, and the common area between the salient object “ribs” and “knife” is smaller than that of the “chopping board,” resulting in low appearance similarity. However, our network can learn the motion relationships between small but critical objects guided by motion features. (3) To address the lack of interaction of different features, we utilize the pre-trained GPT2 [11] to enable multimodal features at multiple frames to interact on the decoder, which can feed multiple tokens fusing through a multi-level attention mechanism. We take the global and local features of different frames as input tokens into the pre-trained GPT2 to conduct multi-frame joint modeling of the two visual features in time sequence and acquire joint representation of various video features and texts.

Fig. 1 [Images not available. See PDF.]

Example of object detection in video where the small common area between “ribs” and “knife leads to low similarity in appearance

To summarize, our work has the following contributions:

We propose a new spatial−temporal hierarchical transformer framework RESTHT for video captioning. About spatial modeling, we combine transformer and GCN to learn local object relationships from two views. About temporal modeling, we utilize transformer to model local features and global features in time sequence jointly. Through spatial-temporal structure, our model can better understand video content and generate detailed and diverse descriptions for a video.
For better local-global and appearance-motion modeling, we propose an enhanced spatial relation model to jointly learn the spatial relationship from appearance and motion perspectives. Firstly, we can capture the appearance-based spatial relationship between objects through the self-attention mechanism and learn important objects’ local details and relationships through the guided local object association modeling network. Secondly, to obtain the motion-based spatial relationship, we jointly consider the object position and motion-relevant information to build the object-relational graph and learn the critical motion interaction of objects through the GCN network.
In experiments, we present adequate evaluations, and the results show that the proposed RESTHT achieves a comparable and even better performance than the state-of-the-art methods on the two benchmark datasets. Especially in the MSVD dataset, our method outperforms other state-of-the-art by about 13 scores in CIDEr.

Related work

Video captioning

Video captioning is a multimodal generation task with learning the joint representation of video and language to generate video descriptions. It is necessary to conduct joint modeling in temporal and spatial relationships. Generally, global appearance features extracted from a 2D-CNN and motion features extracted from a 3D-CNN are used for temporal model, such as Inception-v1 [12], IncepResNetV2 [13], ResNet [14], VGG [15], C3D [16], I3D [17] and 3D ResNeXt [18], while object features extracted from RCNN [19] are used for spatial model. For future real-world scenarios, we can refer to the real-time detection model, for example, the DenseSPH-YOLOv5 [20], which enhanced its performance by introducing additional detection heads to better focus on small objects. We provide relevant background on previous work exploiting local and temporal features.

Object-level spatial modeling

In order to learn the interaction relationship and local details of objects, most networks have used pre-trained object detectors to extract object features and learn the interactions between local objects. ORG-TRL [6] constructed a learnable object-relational graph through cosine distance between the appearance features of objects and employed GCN to learn the local object relationship. OA-BTG [9] introduced a bidirectional time graph to capture the temporal evolutions of salient objects in video and designed VLAD to update behavior characteristics. STG-KD [10] used the normalized Intersection over Union (IoU) value between two objects to build a spatial graph. Regarding graph modeling, we concern with multiple graph learning to leverage the supplementary data from multiple graphs. For example, MGLNN [21] incorporated multiple graph information and acquired an optimal graph data representation for semi-supervised classification tasks, which may be used in spatial modeling by learning multiple objects' graphs.

Temporal modeling.

Most networks fetch temporal information of keyframes by extracting each keyframe's appearance and motion features as global features. The global features are widely used in video captioning owing to its rich visual semantics. In the early days, [1] obtained video representation by fusing global video features in the global pooling layer, destroying global frames' temporal structure. Then, in order to obtain more video representation from global features, HTM [4] and OA-BTG [9] utilized LSTM to learn the time dependence of global features at frame-level. They generated hidden states at each time step, summarizing the previous frame information. ORG-TRL [6], STAT [5] and OpenBook [22] directly operated on global features. Ultimately, they applied previous hidden state to aggregate global information through an attention mechanism and inputted it into a language decoder based on RNNs. While some networks used transformer, MART [23] directly inputted the 2DCNN features into the transformer for training.

Regarding spatial modeling, most networks [6, 9, 10] are not guided by motion features and learn the spatial relationship from a singular perspective and in a single network. It is easy to ignore objects with strong correlations in motion but small common areas. To address the problems, our model learns the spatial interactive information of objects from multiple views. We capture the appearance relationship of objects through self-attention and learn the motion relationship of objects in GCN. Regarding temporal modeling, the global features are not modelled fully and lack interaction with different video features in most networks [1, 4, 5–6, 9]. Meanwhile, the transformer-based model [23] with excellent temporal modeling has yet to be well explored for video captioning. To tackle the shortcoming, we employ pre-trained GPT2 as our temporal model and language decoder and feed multiple tokens to model the appearance, motion and region features of different frames, which can make different features interact across frames and capture long-term dependency and interaction.

Transformer

Transformer [24] has been widely used in natural language processing, such as machine translation, pre-trained language modeling, text generation. It can capture the global and remote dependency of input tokens through the attention mechanism, which breaks the limitation that the RNN model cannot be used for parallel computing. In recent years, transformer has become popular in computer vision. VIT [25] and other networks are widely used in image classification tasks. VIVIT [26] and TimeSformer [27] have also achieved good results in video classification by spatial–temporal modeling. However, large models such as VIT [25] are difficult to optimize for small datasets, and the results are often lower than the model based on CNNs. Furthermore, the training of models at the pixel level necessitates robust hardware capabilities. Therefore, we extract video features through CNN and feed them into transformer, which can preserve the biased induction of convolution and model global dependency through the self-attention mechanism.

Methods

Overview

As shown in Fig. 2, our model consists of a local encoder and a multimodal decoder. In the local encoder part, we propose an enhanced spatial relation model to learn the spatial relationship of objects from two views. We employ a one-layer transformer to model the appearance relationship of local features and use the GCN network to model objects' position and motion relationships. The RNN network is applied to obtain the context information of appearance features of the current frame to choose salient objects. In the multimodal decoder part, we input each frame's local and global features and caption to the model implemented by GPT2 to learn joint representation between visual features and sentences to decode the sentences and reconstruct the global visual features.

Fig. 2 [Images not available. See PDF.]

Overview of the proposed RESTHT: Relation-Enhanced Spatial–Temporal Hierarchical Transformer. We use two modules, Appearance Spatial Modeling (ASM) and Motion Spatial Modeling (MSM), to jointly model spatial relationships from appearance and motion perspectives through self-attention and GCN. A temporal decoder is employed to acquire a joint representation and facilitate interactions among various features within a time sequence

Local encoder

In the local encoder part, we propose an enhanced spatial relation model to mainly model the spatial relationship of objects to augment each local feature. We first extract a certain number of objects in each frame and learn the relationship and local details of objects through the spatial relation network. For a given series of video frames, we uniformly extract $L_{g}$ keyframes for global features and $L_{r}$ keyframes for local features. We extract appearance features $F = {f_{i}}, i = 1, \dots, L_{g}$ from 2D-CNN, motion features $M = {m_{i}}, i = 1, \dots, L_{g}$ from 3D-CNN and adopt the pre-trained Faster-RCNN to extract local features $R^{i} = \{r_{k}^{i}\}, i = 1, \dots, L_{r}, k = 1, \dots, N, r_{k}^{i}$ representing the $k_{th}$ object in the $i_{th}$ keyframe. We reduce the dimensionality of local features through a fully connected layer:

\begin{matrix} R^{i} = W_{1} R^{i} + b_{1} \end{matrix}

where

W_{1}

and

b_{1}

are learnable parameters.

Appearance-based spatial relation modeling

Figure 3 shows our method for spatial relation modeling (ASM) in detail. The appearance features of local objects with interactive relationships have high similarity. To model the appearance relationship of objects, we input the local features of objects in each frame into the transformer. We perform weighted fusion on related objects by calculating the appearance similarity of objects as the attention weight. Because we employ one layer, we do not adopt residual connection. Its calculation formulas are as follows:

\begin{matrix} A t t (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V \end{matrix}

\begin{matrix} \begin{matrix} {R O}^{i} = M u l t i H e a d (R^{i}, R^{i}, R^{i}) \\ = W^{o} C o n c a t ({head}_{1}, \dots, {head}_{m}) \\ {where head}_{i} = A t t (W_{i}^{Q} R^{i}, W_{i}^{k} R^{i}, W_{i}^{V} R^{i}) \end{matrix} \end{matrix}

\begin{matrix} M L P ({R O}^{i}) = W_{3} R e L U (W_{2} {R O}^{i} + b_{2}) + b_{3} \end{matrix}

{LA}^{i} = LayerNorm (MLP ({RO}^{i}))

where

W_{*}

are learnable weights,

b_{*}

are learnable bias.

LayerNorm

denotes layer normalization[28]. Motion-based spatial relation modeling.

Fig. 3 [Images not available. See PDF.]

Overview of the proposed Appearance Spatial Modeling (ASM). We learn the appearance relationship of objects through a transformer block

To learn more related messages in motion from surrounding objects, we jointly consider relative spatial location information and motion-relevant information between two object regions $r_{i}$ and $r_{j}$ . An overview of Motion Spatial Modeling (MSM) is shown in Fig. 4. We compute the location relation through the areas of two objects. We define the location association weight $p_{(m, n)}$ of $r_{n}^{i}$ to $r_{m}^{i}$ as follows:

Fig. 4 [Images not available. See PDF.]

Overview of the proposed Motion Spatial Modeling (MSM). We learn the motion relationship of objects through GCN under the guide of motion feature and location information of objects

\begin{matrix} p_{(m, n)} = \frac{a r e a (r_{m}^{i}) ⋂ a r e a (r_{n}^{i})}{a r e a (r_{m}^{i})} \end{matrix}

The $p_{(m, n)}$ indicates the influence of $r_{n}^{i}$ to $r_{m}^{i}$ from the location perspective. Thus, the location scores matrix $P = {p_{(m, n)}}$ , $m = 1, \dots, N, n = 1, \dots, N$ is constructed, and the two objects with larger overlapping areas exhibit a stronger correlation.

To consider the surrounding objects with low appearance and location correlation but high motion correlation, we use the motion feature $M$ to guide the calculation of motion correlation weight. We combine the aggregated object feature with motion feature as a query and take the aggregated object feature as a key to calculate the motion correlation weight:

\begin{matrix} S Q = W_{4} m^{i} + W_{5} {L A}^{i} + b_{4} \end{matrix}

\begin{matrix} S K = W_{6} {L A}^{i} \end{matrix}

\begin{matrix} s e = W_{7} t a n h (S Q + S K) \end{matrix}

where

W_{4}

W_{5}

W_{6}

W_{7}

b_{4}

are parameters.

{L A}^{i} \in R^{N \times D}

denotes

N

object nodes with

D

-dimensional feature aggregated by self-attention mechanism.

{L A}^{i}

contains the surrounding scene information, which is conducive to inferring the objects with high motion correlation.

s e

computes the relevant score between the motional scene features of each object

S Q

and the scene features of each object

S K

. Then, we mask

s e

according to

P

matrix:

\begin{matrix} s e = \{\begin{matrix} {se}_{(k 1, k 2)}, {ifp}_{(k 1, k 2)} > 0 \\ - i n f \end{matrix}) \end{matrix}

After the mask process, the relevance scores are normalized as attention scores matrix $S P$ :

\begin{matrix} S P = \frac{exp (s, e)}{\sum_{k}^{N} exp ({s e}_{k})} \end{matrix}

For given $N$ objects, we treat each object as a node. We design a motional relation graph $A \in R^{N \times N}$ to aggregate the object features in GCN. The motional relation graph $A$ combines the location information and motion information, which is defined as follows:

\begin{matrix} A = S P + P \end{matrix}

We employ GCN for motion-relevant reasoning and the update of object feature $R^{i}$ :

\begin{matrix} {G H}^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}}, A, {\tilde{D}}^{- \frac{1}{2}}, {G H}^{(l)}, W^{(l)}) \end{matrix}

where

W^{(l)}

is a layer-specific trainable weight matrix,

σ (\cdot)

denotes

ReLU (\cdot)

activation function and

{\tilde{D}}_{ii} = \sum_{j} A_{ij}

{G H}^{(l)} \in R^{N \times D_{g}}

denotes the matrix of activation in the

l_{th}

layer and

{{G H}^{0} = R}^{i}

. Then, we fuse the local features aggregated by the appearance and motion information in two networks:

\begin{matrix} L O = W_{8} [G H ; L A] + b_{8} \end{matrix}

where

W_{8}

and

b_{8}

are learnable parameters and

[G H ; L A]

denotes the concatenation of the two matrices.

In order to obtain the spatial relationships and detailed features of important objects, we apply the appearance features of the video to guide the network. We input the appearance feature $F = {f_{1}, f_{2}, \dots, f_{L_{g}}}$ to the RNN to obtain the context information of $F$ as a guide:

\begin{matrix} {H R}_{t} = t a n h (U * F_{t} + W_{9} * {H R}_{t - 1} + b_{9}) \end{matrix}

where

F_{t}

is the 2D-CNN feature of the current frame,

{H R}_{t - 1}

is the hidden state at the previous time,

{H R}_{t}

is the hidden state at this moment, and

U, W_{9}

and

b_{9}

are learnable parameters. Then, we input the local feature

{L O}^{i}

of the

i_{th}

frame as a key and value and the hidden state

{H R}^{i} \in R^{1 \times d h}

as a query into a transformer layer, where

dh

is the hidden layer dimension:

\begin{matrix} {RD}^{i} = M u l t i H e a d ({HR}^{i}, {LO}^{i}, {LO}^{i}) \\ = C o n c a t (h e a d_{1}, \dots, h e a d_{m}) W^{od} \\ w h e r e h e a d_{i} = A t t (W_{i}^{Qd} {HR}^{i}, W_{i}^{kd} {LO}^{i}, W_{i}^{Vd} {LO}^{i}) \end{matrix}

\begin{matrix} M L P ({R D}^{i}) = W_{10} R e L U (W_{11} {R D}^{i} + b_{11}) + b_{10} \end{matrix}

{tr}^{i} = LayerNorm (MLP, ({RD}^{i}))

where

W_{*}

are learnable weights;

b_{*}

are learnable bias. Finally, we can obtain the updated local feature

T R = \{{t r}^{1}, \dots, {t r}^{L_{r}}\} .

Multimodal decoder

As depicted in Fig. 5, the multimodal decoder is a pre-trained GPT2 model with a two-layer transformer. Following [29], we introduced two tasks to fine-tune our model. One task is to reconstruct the original global features, given the caption and other global features. Another task is to generate captions based on video features and previously generated words.

Fig. 5 [Images not available. See PDF.]

Overview of the proposed Multimodal Decoder (pre-trained GPT2)

To obtain more information, inspired by several prior works [30], we also obtain semantic information in a multi-label classification task. We treat the predicted probability distribution of attributes as semantic feature $S$ . We concatenate video global features $M$ , $F$ and local features $T R$ with semantic features $S,$ respectively. We employ a fully connected layer to transform the appearance, motion and local features into video tokens $U_{d}^{f}$ , $U_{d}^{m}$ and $U_{d}^{r}$ of the same dimension. And the words are embedded into textual tokens $U_{d}^{s}$ :

\begin{matrix} U_{d}^{s} = W_{e} y \end{matrix}

\begin{matrix} U_{d}^{f} = W_{12} [F ; S] + b_{12} \end{matrix}

\begin{matrix} U_{d}^{m} = W_{13} [M ; S] + b_{13} \end{matrix}

\begin{matrix} U_{d}^{r} = W_{14} [T R ; S] + b_{14} \end{matrix}

where

W_{12}

W_{13}

W_{14}

b_{12}, b_{13}

and

b_{14}

are learnable parameters, and

W_{e}

is a learnable word embedding matrix. We obtain the final representation after summing up positional encoding and segment embedding:

{h d v}_{0} = {W_{s}^{v} U}_{d}^{v} + W_{p}

{h d y}_{0} = W_{s}^{y} U_{d}^{s} + W_{p}

where

W_{s}^{v}

and

W_{s}^{y}

denote video segment embedding and caption segment embedding, respectively,

W_{p}

is positional encoding matrix, and

U_{d}^{v} = {{U_{d}^{f} ; U}_{d}^{m} ; U_{d}^{r}}

denotes video tokens. All input tokens

U_{d} = {{U_{d}^{v} ; U}_{d}^{s}}

consist video tokens and textual tokens.

{\cdot ; \cdot}

denotes the concatenation operation. We feed all tokens into the transformer blocks of GPT2:

\begin{matrix} {h d}_{l} = t r a n s f o r m e r b l o c k ({h d}_{l - 1}) \forall i \in [1, t n] \end{matrix}

\begin{matrix} P_{(u)} = s o f t m a x (W_{e}, {h d}_{tn}^{s}) \end{matrix}

\begin{matrix} T_{vg} = W_{15} {h d}_{tn}^{vg} + b_{15} \end{matrix}

where

{h d}_{0} = {{h d v}_{0} ; {h d y}_{0}}

tn

is the number of layers,

{h d}_{tn}^{s}

is the output target textual token corresponding to input textual token

U_{d}^{s}

P_{(u)}

is the probability distribution of target word,

{h d}_{tn}^{vg}

is the output target video token corresponding to input token

{{{U}_{d}^{f} ; U}_{d}^{m}}

T_{vg}

is the reconstructing feature with the same dimensional with

G = \{F ; M ; S\},

and

W_{15}

and

b_{15}

are learnable parameters.

In the language modeling, we minimize the negative log-likelihood loss function to achieve the caption generation task given a video clip v and previous words $y_{1 : t - 1}$ :

L c = - \sum_{t = 1}^{L_{s}} log (Pr (y_{t} | y_{1 : t - 1}, v))

where

y_{t}

is the predicted word at

t_{th}

time step,

Pr (\cdot)

is probability distribution and

Ls

denotes the length of the sentence.

In the video reconstruction task, we train the model by minimizing $L 2$ loss:

L v = \frac{1}{L_{g}} \sum_{t = 1}^{L_{g}} ‖ T_{vg} - G ‖_{2}^{2}

The overall loss $Lo$ is the combination of $Lc$ and $Lv$ :

\begin{matrix} L o = L c + L v \end{matrix}

Experiments

Dataset

We use widely used metrics, including BLEU4[31], METEOR[32], ROUGE-L [33] and CIDEr[34], to evaluate our model on two video captioning datasets. The higher scores can demonstrate that our model achieves better performance.

We conduct experiments on two benchmark datasets, detailed below.

MSVD [35] is a collection of 1970 open-domain video clips collected from YouTube. Each video is 10 to 25 s long, with roughly 40 English captions. We used standard segmentation to divide the dataset into 1,200 training videos, 100 verification videos, and 670 test videos.

MSR-VTT [36] is another widely used video captioning dataset. It comprises 10,000 open-domain videos with 20 categories, and 20 English descriptions accompany each video. Following the existing works, we used the standard segmentation for fair comparison and divided the dataset into 6513 training videos, 497 verification videos, and 2990 test videos.

Implementation details

Feature extraction

Following [37] to process the video, we uniformly select 13 keyframes in each video for extracting global features and use InceptionResNetV2 [13] and I3D [17] as the 2D and 3D CNN feature extractors. Unlike global features, we sample 5 keyframes in each video and apply FatserRCNN [19] trained by [38] to extract 6 region features per frame.

Training and model details

The optimizer of our model is Adam Optimizer [39], and we reduce learning rates from 1e-3 to 6.4e-4. The transformer of the local encoder has a hidden state size of 256 and 8 attention heads. The GCN of the local encoder has 1 layer and a hidden state size of 512. In particular, we set the dimension of local and global features and word embedding size as 768. In the multimodal decoder, the number of transformer layers is 2. We train the model on two datasets for 20 epochs with the batch size of 8.

Ablation study

To verify the effectiveness of each part of the model, we set up a group of ablation experiments for comparison, including the following three structures, denoted as Baseline (only temporal decoder), Baseline + ASM (Appearance spatial modeling), and Baseline + ASM + MSM (Motion spatial modeling).

Baseline: In the baseline, we only feed the appearance, motion, and semantic features into the multimodal decoder as shown in Fig. 5 to validate the effectiveness of global features and the intersection across frames. Through the attention mechanism, the baseline can learn the interaction among global features and captions. It also can utilize rich semantic prior knowledge of GPT2 to generate better descriptions.
Baseline + ASM: In the model, we deploy appearance spatial modeling with the multimodal decoder to validate the effectiveness of appearance spatial modeling of local features. We augment a transformer layer to learn the appearance spatial relationship through appearance similarity between local objects.
Baseline + ASM + MSM: ASM structure only models objects from appearance similarity. In the model, we conduct more comprehensive spatial modeling and augment GCN to learn the motion spatial relationship through location information and motion relationship.

Effect of appearance spatial modeling.

Comparing the results of Baseline and Baseline + ASM, we observe that Baseline + ASM outperforms Baseline in most metrics on both datasets, especially MSVD. The results indicate the effectiveness of appearance spatial modeling in local features. Through a transformer layer, the object can explore the visual association with other objects. Subsequently, the context information of the appearance feature can select the salient objects of current frame. The salient object, updated by ASM, contains the object's visual scene information. The added local information helps the model locate the salient object and learn the main events of the salient object in the temporal interactions of video features.

Effect of motion spatial modeling

From Table 1, it is evident that the overall model outperforms the two previous models. The overall model combines both appearance spatial modeling and motion spatial modeling in two distinct types of networks. It takes into account not only appearance associations but also motion associations, leading to a deeper understanding of the relationships between the salient object and the surrounding objects.

Table 1. Results of the ablation study with various settings on the MSVD and MSR-VTT datasets

Dataset	Method	BLEU-4	ROUGE-L	METEOR	CIDEr
MSVD	Baseline	57.6	76.9	39.6	110.5
	Baseline + ASM	57.6	76.8	40.5	111.9
	Baseline + ASM + MSM	58.1	77.8	40.9	116.6
MSR-VTT	Baseline	41.9	61.4	28.4	50.3
	Baseline + ASM	43.3	62.2	29.3	50.7
	Baseline + ASM + MSM	43.5	62.1	29.4	51.5