Audio-Visual Action Recognition Using Transformer

Full text

Turn on search term navigation

1. Introduction

As the sources of video data vary across diverse fields such as the military [1], unmanned systems [2], surveillance systems [3], and personal security [4], the range of human actions depicted in videos also becomes highly diversified. As a result, recognizing these actions presents a significant challenge. This challenge necessitates the utilization of more sophisticated methods and additional information sources to accurately differentiate between these actions. We note that the traditional approaches [5,6,7,8,9,10] to action recognition in videos have predominantly relied solely on visual information.

With significant advancements in video-based deep learning techniques [7,8,9,10,11,12,13,14], visual features have become the primary clues for tasks related to video understanding. This predominance stems from the visual nature of videos, where actions, movements, and scenes are primarily conveyed through visual cues. Visual features, such as shapes, colors, and motion patterns, provide critical information for interpreting content and discerning different actions or events within a video. In the two-stream approach [7,8,11], spatial and temporal features are extracted from each spatial and temporal stream. For capturing temporal features, [7,11] utilized optical flow, and [8] used dynamic images. By extending input data along the temporal dimensions, [9,10,12,13,14] enabled the learning of spatio-temporal features. However, relying solely on visual information can be limiting, especially in scenarios where visual cues are ambiguous or obscured. We recognize that most videos inherently comprise both visual and audio components, which have the potential to complement each other. Audio can enrich video understanding by providing additional layers of information not always visible, such as dialogue, background sounds, or audio cues that correspond to off-screen activities. In challenging situations such as occlusions, poor lighting, or ambiguous actions, the audio component can provide indispensable contextual clues, thereby enhancing action recognition. For instance, the sound of footsteps or a vehicle in motion can offer vital insights into an action that is not visually discernible. Recent research [15,16,17] has begun exploring the potential of combined audio-visual data, emphasizing its significance in advancing video understanding. Wang et al. [15] proposed a framework to learn from video appearance, motion, and audio, investigating both early and late fusion. Arandjelovic et al. [16] utilized the correspondence between visual and audio information for training the network. Lastly, Xiao et al. [17] utilized a pathway that connects audio features to the layers learning visual features, thereby training a unified representation. This synergy between audio and visual elements allows for a more comprehensive approach to video analysis, leading to improved accuracy and robustness in action recognition tasks.

In previous CNN-based approaches, the fusion of different modalities was achieved through straightforward methods, such as early fusion and late fusion. Early fusion, also known as feature-level fusion, involves combining features from different modalities, like audio and visual, at the initial stage before feeding them into a learning model. This method can capitalize on the raw data’s synergy but may also introduce noise and complexity. On the other hand, late fusion, or decision-level fusion, occurs at a much later stage, where the decisions or predictions made from each individual modality are combined. While this approach maintains the purity of each modality’s data, it may fail to capture deeper, more complex inter-modal interactions. Meanwhile, with the advent of the transformer architecture, more sophisticated attention mechanisms, such as self-attention [18], have proven to be highly effective in capturing complex relationships within a single modality by enabling the model to assign varying degrees of importance to different elements in the input sequence. This powerful capability of self-attention can be harnessed to seamlessly integrate information from diverse modalities, like audio and visual data, by extending its application beyond a single modality. This enables the model to dynamically adjust its focus and allocate attention to the most relevant features from each modality, thereby facilitating the effective fusion of multi-modal information and enhancing the overall performance of tasks like action recognition.

Our approach is centered on two innovative methodologies aimed at enhancing the effectiveness and precision of multi-modal feature learning and fusion for action recognition. Firstly, we introduce a transformer model designed to learn and extract features from multi-modal data, specifically audio and visual information. This model processes individual modalities independently, capturing intricate relationships and patterns within each modality. It is adept at extracting high-level semantic features from both audio and visual data, ensuring a comprehensive understanding of each modality’s unique characteristics. Secondly, we present an attention transformer module that enables the effective integration of the extracted audio and visual features. This module is engineered to compute cross-modal attention between audio and visual elements, allowing the model to selectively emphasize the most relevant modality in a given context. Leveraging the attention mechanism, the model dynamically assigns importance weights to audio and visual features, leading to more robust and accurate action recognition by synthesizing the strengths of each modality.

2. Related Work

Audio-visual action recognition has attracted considerable attention in recent years as researchers strive to leverage the synergistic relationship between audio and visual information, thereby improving the performance of action recognition systems. One of the earliest approaches [19] to audio-visual action recognition focused on fusion techniques, including early fusion, late fusion, and intermediate fusion. These methods aimed to combine audio and visual features at different stages of the training process, whether at the input level, decision-making stage, or intermediate points throughout the training process.

With the advent of the deep learning method, significant advancements have been made in audio-visual action recognition. Researchers have employed convolutional neural networks (CNNs) to learn spatial features from video frames and recurrent neural networks (RNNs) and long short-term memory (LSTM) to capture temporal patterns [20,21] in audio and visual data. Various strategies have also been proposed to fuse the features, including concatenation, element-wise summation, and attention mechanisms [22,23].

More recently, transformer-based models have gained popularity in audio-visual action recognition because of their ability to model long-range dependencies and capture complex interactions between audio and visual modalities. The Multimodal Bottleneck Transformer (MBT) [24] introduces a transformer architecture that constructs multi-modal bottleneck tokens to efficiently fuse video and audio features from image and audio transformers, surpassing traditional late-fusion methods.

In this paper, we deviate from using the Video Swin transformer [13] and opt for a single Swin Transformer [25] designed for still images. We apply this single transformer to all modalities—image, video, and audio—enabling greater resource efficiency without compromising competitive performance.

3. Method

In this section, we introduce our transformer-based architecture for audio-visual action recognition. As shown in Figure 1, the structure consists of three main components: Data processing, feature extraction, and the modal fusion module (MFM). In the data processing part, three distinct operations are performed: randomly selecting a single frame ( $X_{i}$ ) from the video sequence (X), selecting a set of T frames from X, which are then passed through the motion module, and the conversion of an audio signal into a spectrogram. We refer readers to [26] for the methodology of the frame selection. In the feature extraction, features are extracted from each input modality derived from the data processing phase. Given that the outputs of the data processing module for the three modalities share a common 2D image format with three channels, we can employ a single Swin transformer [25] for subsequent feature extraction. Finally, within the modal fusion module, the transformer encoder structure is applied to execute feature fusion.

3.1. Data Processing

The data processing module efficiently transforms spatial, temporal, and audio data into a standardized 2D image format ( $W \times H \times C$ ), where W is the width, H is the height of the image in pixels, and C is the number of channels, ready for subsequent feature extraction by the Swin transformer. In the case of spatial data, a single frame ( $X_{i}$ ) is selected randomly from the video sequence (X). This random selection exposes the model to diverse frames during training, thereby improving its generalization capabilities and increasing data variability. In the temporal stream, T frames, each having dimensions of $W \times H \times C$ , are selected by uniform sampling. Then, these T frames are condensed into a single frame of $W \times H \times C$ by the S3D (Shallow 3D CNN) motion module [27]. Notably, the S3D module does not utilize fixed weights; instead, its weights are initialized and updated during training, enabling a more adaptive and robust representation of motion features. Lastly, for the audio stream, the raw audio signal associated with the video is transformed into a log-mel spectrogram representation. Figure 2 illustrates this transformation by showing both the original waveform and the resulting log-mel spectrogram. This transformation involves first applying a short-time Fourier transform (STFT) to the audio signal to obtain a frequency to different frequencies. This allows the deep learning model to process the audio information alongside the visual features, capturing the frequency components of the sound over time.

3.2. Audio-Visual Feature Extraction

The feature extraction module is designed to extract learned features from the three processed inputs of spatial, temporal, and audio data, all uniformly transformed into a standardized format of $W \times H \times C$ . This standardized approach ensures consistency in the feature extraction process across all modalities. To facilitate reproducibility, we specify that the Swin transformer is employed as a unified feature extraction network. This network employs shared parameters in a single Swin Transformer to efficiently learn and handle all modalities rather than using separate transformers for each. This approach ensures a consistent and integrated method for feature extraction across the different types of data. The extracted features, denoted as $f_{I}$ (spatial), $f_{V}$ (temporal), and $f_{A}$ (audio), are not combined all at once. Instead, they are paired and directed into three distinct modal fusion modules (MFMs). Each MFM is responsible for fusing two types of modalities: the first for spatial-temporal ( $f_{I}$ and $f_{V}$ ), the second for spatial-audio ( $f_{I}$ and $f_{A}$ ), and the third for temporal-audio ( $f_{V}$ and $f_{A}$ ) data. This pairwise fusion strategy, implemented in separate MFMs, allows for a more nuanced and effective integration of the multi-modal data. Through this approach, the model size is kept manageable, avoiding the complexity that would arise from employing distinct feature extraction networks for each modality.

3.3. Transformer-Based Feature Fusion

To integrate the features extracted by the feature extraction module, a transformer-based modal fusion module (MFM) is proposed, as shown in Figure 1. Our MFM is shown in Figure 3. The MFM fuses each feature through the process of co-attention with the transformer encoders. Rather than fusing all features simultaneously, it performs pairwise fusion. Specifically, the process involves the fusion of spatial and temporal features, audio and spatial features, and temporal and audio features, respectively. This methodical approach ensures that the unique characteristics of each modality are effectively combined, allowing for a more comprehensive understanding of the multi-modal data.

As shown in Figure 3, the MFM consists of two transformer encoders, each taking in different modal features $f_{m o d a l_1}$ and $f_{m o d a l_2}$ as inputs. The transformer encoder, as illustrated in Figure 4, follows the same structural design as the encoder used in the vision transformer [28]. However, unlike the conventional approach where the Query, Key, and Value inputs are identical, our implementation feeds different modal features into these components.

One transformer encoder performs modal fusion by taking $f_{m o d a l_1}$ as the Key and Value and $f_{m o d a l_2}$ as the Query, while the other transformer encoder performs modal fusion by using $f_{m o d a l_1}$ as the Query and $f_{m o d a l_2}$ as the Key and Value. The fused attention output vector $f_{m o d a l_12}$ and $f_{m o d a l_21}$ can be represented as follows:

(1) $\begin{matrix} f_{m o d a l_12} & = Attention (f_{m o d a l_2}, f_{m o d a l_1}, f_{m o d a l_1}) \\ = Softmax (\frac{f_{m o d a l_2} f_{m o d a l_1}^{T}}{\sqrt{d_{k}}}) f_{m o d a l_2} \end{matrix}$

(2) $\begin{matrix} f_{m o d a l_21} & = Attention (f_{m o d a l_1}, f_{m o d a l_2}, f_{m o d a l_2}) \\ = Softmax (\frac{f_{m o d a l_1} f_{m o d a l_2}^{T}}{\sqrt{d_{k}}}) f_{m o d a l_1}, \end{matrix}$

where

d_{k}

is the dimensionality of the Key vector and T denotes the transpose operation. The attention function [18] in Equations (1) and (2) takes in Query, Key, and Value in that order. The fused attention output vectors are combined using concatenation to produce the final output Y.

In the fusion of spatial image and temporal video, image features $(f_{I})$ and video frame features ( $f_{V}$ ) are fed into the MFM, yielding the fused features $f_{I V}$ and $f_{V I}$ . The fused attention output vectors are combined using concatenation to produce the final output $Y_{I V}$ . The fused attention output vector $f_{I V}$ , $f_{V I}$ , and $Y_{I V}$ can be represented as follows:

(3) $f_{I V} = Attention (f_{V}, f_{I}, f_{I}) = Softmax (\frac{f_{V} f_{I}^{T}}{\sqrt{d_{k}}}) f_{I}$

(4) $f_{V I} = Attention (f_{I}, f_{V}, f_{V}) = Softmax (\frac{f_{I} f_{V}^{T}}{\sqrt{d_{k}}}) f_{V}$

(5) $Y_{I A} = Concat (f_{I A}, f_{A I})$

In the fusion of spatial image and audio, image features $(f_{I})$ and audio features $(f_{A})$ are combined in the MFM, producing the fused features $f_{I A}$ and $f_{A I}$ . The attention output vectors from these fusion processes are then combined using concatenation to create the final output for the spatial-audio combination, denoted as $Y_{I A}$ . The equations for the fused attention output vectors $f_{I A}$ , $f_{A I}$ , and $Y_{I A}$ are as follows:

(6) $f_{I A} = Attention (f_{I}, f_{A}, f_{A}) = Softmax (\frac{f_{I} f_{A}^{T}}{\sqrt{d_{k}}}) f_{A}$

(7) $f_{A I} = Attention (f_{A}, f_{I}, f_{I}) = Softmax (\frac{f_{A} f_{I}^{T}}{\sqrt{d_{k}}}) f_{I}$

(8) $Y_{A I} = Concat (f_{A I}, f_{I A})$

In the fusion of temporal video and audio, video frame features $(f_{V})$ and audio features $(f_{A})$ are combined in the MFM, producing the fused features $f_{V A}$ and $f_{A V}$ . The attention output vectors from these fusion processes are then combined using concatenation to create the final output for the temporal-audio combination, denoted as $Y_{V A}$ . The equations for the fused attention output vectors $f_{V A}$ , $f_{A V}$ , and $Y_{V A}$ are as follows:

(9) $f_{V A} = Attention (f_{V}, f_{A}, f_{A}) = Softmax (\frac{f_{V} f_{A}^{T}}{\sqrt{d_{k}}}) f_{A}$

(10) $f_{A V} = Attention (f_{A}, f_{V}, f_{V}) = Softmax (\frac{f_{A} f_{V}^{T}}{\sqrt{d_{k}}}) f_{V}$

(11) $Y_{V A} = Concat (f_{V A}, f_{A V})$

The final stage of our model involves integrating the outputs from the spatial-temporal ( $Y_{I V}$ ), temporal-audio ( $Y_{V A}$ ), and spatial-audio ( $Y_{I A}$ ) into one comprehensive output Y. This is achieved through concatenation, allowing for the preservation and combination of distinct features from each modality. The equation for this final concatenation is

(12) $Y = Concat (Y_{I V}, Y_{V A}, Y_{I A})$

Once Y is obtained, it is fed into a final classification layer (MLP), which is responsible for the action recognition task. This layer, typically a fully connected neural network, interprets the rich, multi-modal feature set represented by Y to accurately classify and recognize various actions in the input video.

4. Experimental Results

4.1. Implementation

All our experiments were conducted in Intel i7-4790 CPU, NVIDIA TESLA V100 GPU, Pytorch 1.12.1 environments, and Ubuntu 20.4 LTS version. A pre-trained Swin transformer [25] with ImageNet [29] was adopted, and a trained model using AdamW [30] was used as the optimizer. The batch size was set to 16, with an initial learning rate of 5 $\times 10^{- 4}$ , and a cosine annealing scheduler [31] was adopted.

For the spatial stream, during training, a single frame was randomly selected from each video to enhance the model’s ability to generalize. For testing, the central frame of each video was consistently chosen to ensure uniformity and comparability of results. For the temporal stream processing in the S3D module, we utilized uniform sampling to select 16 frames as inputs, ensuring diverse and representative coverage of the video content.

The raw audio data were sampled at 22.05 kHz, and the input features were extracted using a short-time Fourier transformation (STFT) with a window size of 2048, an overlap of 50%, and 256 Mel bands. We used the augmentation methods of random crop, random flip, color jittering, and autoaug [32]. The default version of Swin transformer [25] is referred to as Swin-B. Additionally, there are versions Swin-T, Swin-S, and Swin-L, which have model and computational complexities of 0.25×, 0.5×, and 2×, respectively. Following the original Swin Transformer configuration, we set the window size to $M = 7$ and the query dimension of each head to $d = 32$ in our models. The other hyper-parameters of each model are as follows:

Swin-T: $C^{*} = 96$ , $L = {2, 2, 6, 2}$ ,
Swin-S: $C^{*} = 96$ , $L = {2, 2, 18, 2}$ ,
Swin-B: $C^{*} = 128$ , $L = {2, 2, 18, 2}$ ,
Swin-L: $C^{*} = 192$ , $L = {2, 2, 18, 2}$ ,

where

C^{*}

is the channel number of the hidden layers in the first stage and L is the layer number of each stage. The metric for evaluating model performance is accuracy.

4.2. Datasets

For the performance evaluation of the proposed method, UCF-sound [15], Kinetics-sound [16,17], and audio-visual datasets were adopted. The UCF-sound [15] is a dataset derived from UCF-101 [33], with or without an invalid soundtrack from UCF-101. The UCF-sound contains 6624 video clips with 50 classes. We divided the dataset into training and test sets, following the same split used in [15], resulting in 4733 samples for training and 1891 samples for testing. The classes in the UCF-sound dataset are diverse, covering a wide range of auditory activities, with each class incorporating both video and audio elements. These classes are listed below, with samples of these classes presented in Figure 5a.

applylipstick, archery, babycrawling, balancebeam, bandmarching, basketballdunk, blowdryhair, blowingcandles, bodyweightsquats, bowling, boxingpunchingbag, boxingspeedbag, brushingteeth, cliffdiving, cricketbowling, cricketshot, cuttinginkitchen, fieldhockeypenalty, floorgymnastics, frisbeecatch, frontcrawl, haircut, hammerthrow, hammering, handstandpushups, handstandwalking, headmassage, icedancing, knitting, longjump, moppingfloor, parallelbars, playingcello, playingdaf, playingdhol, playingflute, playingsitar, rafting, shavingbeard, shotput, skydiving, soccerpenalty, stillrings, sumowrestling, surfing, tabletennisshot, typing, unevenbars, wallpushups, writingonboard

The Kinetics-sound [16,17] is a dataset derived from the Kinetics video dataset [34]. Among the subsets of the Kinetics-sound dataset, one contains 34 human action classes that clearly contain audio-visual information. However, some classes were removed, and only 32 classes were used in the experiment, as in [17]. We divided the dataset into training and test sets, resulting in 22,914 samples for training and 1585 samples for testing. The classes in the Kinetics-sound dataset are primarily centered around musical instruments and everyday actions, with each class incorporating both video and audio elements. These classes are listed below, with samples of these classes presented in Figure 5b.

blowingnose, blowingoutcandles, blowingoutcandles, bowling, choppingwood, dribblingbasketball, laughing, mowinglawn, playingaccordion, playingbagpipes, playingbassguitar, playingclarinet, playingdrums, playingguitar, playingharmonica, playingkeyboard, playingorgan, playingpiano, playingviolin, playingxylophone, playingsaxophone, rippingpaper, shufflingcards, singing, stompinggrapes, strummingguitar, tapdancing, tappingguitar, tappingpen, tickling, playingtrumpet, playingtrombone9

Figure 5

Examples of the actions in the UCF-sound and Kinetics-sound datasets: (a) is UCF-sound and (b) is Kinetics-sound.

[Figure omitted. See PDF]

4.3. Results

The performance of the proposed method was evaluated with several state-of-the-art approaches on the UCF-sound [15] and Kinetics-sound [17] datasets. The experiment was conducted across four distinct cases, examining various combinations of audio-visual data. These cases encompassed the following: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio. This facilitates a comprehensive evaluation of the interactions between various audio-visual configurations and their influence on the experimental outcomes.

In the UCF-sound dataset, Table 1 represents an ablation study comparing the performance of different versions of the Swin transformer for the four cases (1)–(4). The results show that case (1), which includes all elements (a single frame, T frames, and audio), consistently outperforms the other cases across all Swin Transformer variants, indicating the importance of integrating all modalities. This indicates the effectiveness of a multi-modal approach. Conversely, case (4), involving only T frames and their corresponding audio, shows the lowest performance, suggesting that the spatial data (single frame) significantly contributes to the model’s accuracy. The performance differences between cases (2) and (3) further emphasize the individual contributions of the spatial and audio modalities.

The results of the entire Swin Transformer versions show that case (3) outperforms the other two cases, while case (1) has slightly better performance than case (2). Consequently, while the inclusion of audio data does contribute to an overall improvement in performance, it is evident that visual data play a more crucial role in determining the effectiveness of the model on the UCF-sound dataset.

The results of the ablation study on the Kinetics-sound dataset are shown in Table 2. This consistency across datasets underscores the importance of integrating spatial, temporal, and audio data for optimal action recognition. Notably, in this dataset, case (4) (temporal frames and audio) outperforms case (2) (single frame and T frames without audio), suggesting that the combination of motion and audio features is more effective than spatial and temporal features alone. These variations between the two datasets highlight the impact of dataset-specific characteristics on the efficacy of different modality combinations.

Wang et al. [15] achieved results by either the early fusion (EF) or late fusion (LF) of spatial, temporal, and audio features and by making predictions using either neural networks (NNs) or SVM (see Table 3). EF involves concatenating features or transforming the feature space at the feature level, while LF refers to fusing different modalities at the decision level. Our Swin-B model achieves a superior performance of 93.00%, which is approximately 11% better than the best result of Wang et al. [15].

As shown in Table 4, our proposed method (1) with Swin-B achieves the highest accuracy of 89.34% on the Kinetics-sound dataset. This performance surpasses those of all other listed methods, including the Multi-level Attention Fusion Network (MAFnet) [35], AVSlowFast models [17], and even the MBT [24]. Notably, our method’s accuracy is approximately 4% higher than the nearest competing methods, marking a significant improvement.

4.4. Visual Interpretation with Grad-CAM

Figure 6 and Figure 7 present GRAD-CAM [37] visualizations for the UCF-sound and Kinetics-sound datasets. GRAD-CAM, which stands for gradient-weighted class activation mapping, leverages the gradient information from the last transformer layer of our feature extraction network. In each figure, alongside the original image, we provide two GRAD-CAM images. The first GRAD-CAM image is derived using only visual information, while the second incorporates data from all modalities. A notable observation is that the activations highlighted in the multi-modal model are more prominent and concentrated around areas of significant movement or sound generation. This contrast suggests that incorporating audio data alongside visual information enhances the model’s ability to focus on relevant parts of the scene, leading to more accurate and interpretable results. The comparison between these two sets of GRAD-CAM images underscores the added value of audio data in enriching the model’s understanding of the scene.

5. Discussion

In this study, we propose a novel approach for enhancing action recognition by integrating visual and audio information from videos using a transformer structure. While our model exhibits promising capabilities, there are inherent limitations and areas for future exploration. Firstly, the model employs a shared feature extraction network for both audio and visual data. This design was primarily driven by the aim to maintain network efficiency; however, it may impose limitations on the model’s ability to distinctly capture modality-specific features. For instance, separate feature extractors, such as 3D CNNs [9,38] for visual temporal features, and VGGish [39] or audio spectrogram transformers (ASTs) [40] for audio, can potentially yield better recognition performance. Secondly, our work focused on the fusion of visual and audio data, and we evaluated its performance. Recently, with the rise of multi-modal learning, various large-scale audio-visual datasets like Epic-sounds [41] have emerged. It will be beneficial to conduct future experiments on these large-scale datasets to verify our model’s performance. Moreover, there is a growing trend of attempting multi-modal learning with various modalities, such as text or data from different sensors. It is necessary to explore whether our structure can be beneficial in these contexts as well. Our structure, designed for visual and audio data, uses three modal fusion modules. However, with the addition of other modal data, additional modal fusion modules may be required. This can lead to an increase in model size or may not be optimal structurally. To address this, an improved modal fusion module capable of accommodating more modalities is needed.

6. Conclusions

In this study, we introduce a novel approach using a single Swin transformer across various modalities, including image, video, and audio. Our method simplifies the multi-modal fusion process with the integration of an attention transformer module in the modal fusion module (MFM), effectively fusing audio and visual features. This streamlined approach shows substantial performance improvements in robust action recognition compared to existing methods. The combined use of both visual and audio information is crucial for a comprehensive understanding of actions, which has significant implications for applications in areas such as surveillance, content analysis, and interactive media.

Author Contributions

Conceptualization, J.-H.K. and C.S.W.; methodology, J.-H.K. and C.S.W.; software, J.-H.K.; validation, J.-H.K.; formal analysis, J.-H.K.; investigation, J.-H.K.; resources, J.-H.K.; data curation, J.-H.K.; writing—original draft preparation, J.-H.K.; writing—review and editing, C.S.W.; visualization, J.-H.K.; supervision, C.S.W.; project administration, C.S.W.; funding acquisition, C.S.W. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here. UCF-Sound: https://www.crcv.ucf.edu/research/data-sets/ucf101/, accessed on 8 January 2024. Kinetics-Sound: https://deepmind.com/research/open-source/kinetics, accessed on 8 January 2024. For information on how to obtain UCF-Sound and Kinetics-Sound from UCF-101 and Kinetics-400, go to https://github.com/kjunhwa/audiovisual_action, accessed on 8 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The overall transformer-based architecture for the audio-visual action recognition.

View Image - Figure 2. Visualization of audio transformation for the Kinetics-sound dataset. Top to bottom: raw audio waveforms followed by their respective log-mel spectrogram representations. Top-left: ‘Blowing Nose’, top-right: ‘Playing Accordion’, bottom-left: ‘Playing Trumpet’, bottom-right: ‘Tapping Pen’.

Figure 2. Visualization of audio transformation for the Kinetics-sound dataset. Top to bottom: raw audio waveforms followed by their respective log-mel spectrogram representations. Top-left: ‘Blowing Nose’, top-right: ‘Playing Accordion’, bottom-left: ‘Playing Trumpet’, bottom-right: ‘Tapping Pen’.

Figure 3. The general structure of the modal fusion module (MFM).

Figure 4. Transformer encoder.

View Image - Figure 6. GRAD-CAM visualization results of UCF-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

Figure 6. GRAD-CAM visualization results of UCF-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

View Image - Figure 7. GRAD-CAM visualization results of Kinetics-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

Figure 7. GRAD-CAM visualization results of Kinetics-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

Table 1

The ablation study for Swin transformer variants on the UCF-sound dataset: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio.The bold numbers represent the best results for each Swin type.

Case	Swin-T	Swin-S	Swin-B	Swin-L
(1)	86.63%	91.05%	93.00%	92.53%
(2)	84.21%	87.79%	91.00%	91.95%
(3)	86.63%	90.16%	91.79%	91.84%
(4)	74.32%	79.26%	89.05%	90.84%

Table 2

The ablation study for Swin transformer on the Kinetics-sound dataset: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio. The bold numbers represent the best results for each Swin type.

Case	Swin-T	Swin-S	Swin-B	Swin-L
(1)	87.50%	88.25%	89.34%	89.28%
(2)	82.24%	84.43%	85.38%	86.00%
(3)	86.48%	87.80%	88.59%	89.05%
(4)	85.25%	86.20%	88.18%	87.36%

Table 3

Comparison of multi-modal results on UCF-sound dataset. EF denotes early fusion, and LF denotes late fusion. NN denotes neural network. The bold numbers represent the best results.

Case	Accuracy
EF-NN [15]	80.06%
LF-NN [15]	61.00%
EF-SVM [15]	66.10%
LF-SVM [15]	82.50%
Proposed methods (Swin-B, case (1))	93.00%

Table 4

Comparison of multi-modal results on Kinetics-sound dataset. R50 denotes ResNet-50 [36], and R101 denotes ResNet-101 [36]. MAF denotes Multi-level Attention Fusion Network. The bold numbers represent the best results.

Method	Accuracy
$L^{3}$ -Net [16]	74.00%
SlowFast [17], R50	80.50%
AVSlowFast [17], R50	83.70%
SlowFast [17], R101	82.70%
MAFnet [35]	83.94%
AVSlowFast [17]	85.00%
MBT [24]	85.00%
Proposed methods (Swin-B, case (1))	89.34%

References

1. Ukani, V.; Thakkar, P. A hybrid video based iot framework for military surveillance. Des. Eng.; 2021; 5, pp. 2050-2060.

2. Zhang, Q.; Sun, H.; Wu, X.; Zhong, H. Edge video analytics for public safety: A review. Proc. IEEE; 2019; 107, pp. 1675-1696. [DOI: https://dx.doi.org/10.1109/JPROC.2019.2925910]

3. Kim, D.; Kim, H.; Mok, Y.; Paik, J. Real-time surveillance system for analyzing abnormal behavior of pedestrians. Appl. Sci.; 2021; 11, 6153. [DOI: https://dx.doi.org/10.3390/app11136153]

4. Prathaban, T.; Thean, W.; Sazali, M.I.S.M. A vision-based home security system using OpenCV on Raspberry Pi 3. AIP Conf. Proc.; 2019; 2173, 020013.

5. Ohn-Bar, E.; Trivedi, M. Joint angles similarities and HOG2 for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; Portland, OR, USA, 23–28 June 2013; pp. 465-470.

6. Wang, H.; Schmid, C. Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision; Sydney, NSW, Australia, 1–8 December 2013; pp. 3551-3558.

7. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Proceedings of the 28th Annual Conference on Neural Information Processing Systems 2014; Montreal, QC, Canada, 8–13 December 2014; Volume 27.

8. Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A. Action Recognition with Dynamic Image Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2017; 40, pp. 2799-2813. [DOI: https://dx.doi.org/10.1109/TPAMI.2017.2769085] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29990080]

9. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision; Santiago, CA, USA, 7–13 December 2015; pp. 4489-4497.

10. Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 6299-6308.

11. Khan, S.; Hassan, A.; Hussain, F.; Perwaiz, A.; Riaz, F.; Alsabaan, M.; Abdul, W. Enhanced spatial stream of two-stream network using optical flow for human action recognition. Appl. Sci.; 2023; 13, 8003. [DOI: https://dx.doi.org/10.3390/app13148003]

12. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202-6211.

13. Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 3202-3211.

14. Wang, H.; Zhang, W.; Liu, G. TSNet: Token Sparsification for Efficient Video Transformer. Appl. Sci.; 2023; 13, 10633. [DOI: https://dx.doi.org/10.3390/app131910633]

15. Wang, C.; Yang, H.; Meinel, C. Exploring multimodal video representation for action recognition. Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN); Vancouver, BC, Canada, 24–29 July 2016; pp. 1924-1931.

16. Arandjelovic, R.; Zisserman, A. Look, listen and learn. Proceedings of the IEEE International Conference on Computer Vision; Honolulu, HI, Hawaii, 21–26 July 2016; pp. 609-617.

17. Xiao, F.; Lee, Y.J.; Grauman, K.; Malik, J.; Feichtenhofer, C. Audiovisual slowfast networks for video recognition. arXiv; 2020; arXiv: 2001.08740

18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems; Long Beach, CA, USA, 4–9 December 2017; 30.

19. Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst.; 2010; 16, pp. 345-379. [DOI: https://dx.doi.org/10.1007/s00530-010-0182-0]

20. Wöllmer, M.; Kaiser, M.; Eyben, F.; Schuller, B.; Rigoll, G. LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput.; 2013; 31, pp. 153-163. [DOI: https://dx.doi.org/10.1016/j.imavis.2012.03.001]

21. Gupta, M.V.; Vaikole, S.; Oza, A.D.; Patel, A.; Burduhos-Nergis, D.P.; Burduhos-Nergis, D.D. Audio-Visual Stress Classification Using Cascaded RNN-LSTM Networks. Bioengineering; 2022; 9, 510. [DOI: https://dx.doi.org/10.3390/bioengineering9100510]

22. Zhang, Y.; Wang, Z.-R.; Du, J. Deep fusion: An attention guided factorized bilinear pooling for audio-video emotion recognition. Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN); Budapest, Hungary, 14–19 July 2019; pp. 1-8.

23. Duan, B.; Tang, H.; Wang, W.; Zong, Z.; Yang, G.; Yan, Y. Audio-visual event localization via recursive fusion by joint co-attention. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Virtual Conference, 5–9 January 2021; pp. 4013-4022.

24. Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 14200-14213.

25. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision; Virtual Conference, 11–17 October 2021; pp. 10012-10022.

26. Kim, J.-H.; Won, C.S. Action Recognition in Videos Using Pre-trained 2D Convolutional Neural Networks. IEEE Access; 2020; 8, pp. 60179-60188. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2983427]

27. Kim, J.-H.; Kim, N.; Won, C.S. Deep edge computing for videos. IEEE Access; 2021; 9, pp. 123348-123357. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3109904]

28. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv; 2020; arXiv: 2010.11929

29. Jia, D.; Wei, D.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; Miami, FL, USA, 20–25 June 2009; pp. 248-255.

30. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv; 2017; arXiv: 1711.05101

31. Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv; 2016; arXiv: 1608.03983

32. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv; 2018; arXiv: 1805.09501

33. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv; 2012; arXiv: 1212.0402

34. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P. et al. The kinetics human action video dataset. arXiv; 2017; arXiv: 1705.06950

35. Brousmiche, M.; Rouat, J.; Dupont, S. Multi-level attention fusion network for audio-visual event recognition. arXiv; 2021; arXiv: 2106.06736

36. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

37. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 618-626.

38. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2012; 35, pp. 221-231. [DOI: https://dx.doi.org/10.1109/TPAMI.2012.59] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22392705]

39. Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B. et al. CNN architectures for large-scale audio classification. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA, 5–9 March 2017; pp. 131-135.

40. Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv; 2021; arXiv: 2104.01778

41. Huh, J.; Chalk, J.; Kazakos, E.; Damen, D.; Zisserman, A. Epic-Sounds: A Large-Scale Dataset of Actions that Sound. Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Rhodes Island, Greece, 4–10 June 2023; pp. 1-5.

Word count: 5678

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Our approach to action recognition is grounded in the intrinsic coexistence of and complementary relationship between audio and visual information in videos. Going beyond the traditional emphasis on visual features, we propose a transformer-based network that integrates both audio and visual data as inputs. This network is designed to accept and process spatial, temporal, and audio modalities. Features from each modality are extracted using a single Swin Transformer, originally devised for still images. Subsequently, these extracted features from spatial, temporal, and audio data are adeptly combined using a novel modal fusion module (MFM). Our transformer-based network effectively fuses these three modalities, resulting in a robust solution for action recognition.

Details

Title

Audio-Visual Action Recognition Using Transformer Fusion Network

Author

Jun-Hwa, Kim

; Chee Sun Won

First page

1190

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app14031190

ProQuest document ID

2923927787

Audio-Visual Action Recognition Using Transformer Fusion Network

Jump to:

Full text

Abstract

Details

Suggested sources