Visual perception enhancement fall detection algorithm based on vision transformer

Abstract

Fall detection is a crucial research topic in public healthcare. With advances in intelligent surveillance and deep learning, vision-based fall detection has gained significant attention. While numerous deep learning algorithms prevail in video fall detection due to excellent feature processing capabilities, they all exhibit limitations in handling long-term spatiotemporal dependencies. Recently, Vision Transformer has shown considerable potential in integrating global information and understanding long-term spatiotemporal dependencies, thus providing novel solutions. In view of this, we propose a visual perception enhancement fall detection algorithm based on Vision Transformer. We utilize Vision Transformer-Base as the baseline model for analyzing global motion information in videos. On this basis, to address the model’s difficulty in capturing subtle motion changes across video frames, we design an inter-frame motion information enhancement module. Concurrently, we propose a locality perception enhancement self-attention mechanism to overcome the model’s weak focus on local key features within the frame. Experimental results show that our method achieves notable performance on the Le2i and UR datasets, surpassing several advanced methods.

Full text

Translate

Turn on search term navigation

Introduction

Against the backdrop of the severe aging population issue, the health of the elderly demands urgent attention [1]. Falls, a major cause of accidental injuries and death, gravely affects their health and quality of life, increasing the need for long-term care and medical expenses [2]. Therefore, the application of effective fall detection technology is imperative in reducing casualties and economic losses.

With the rise of modern monitoring, vision-based fall detection has come to prominence. Unlike technologies based on wearable sensors and environmental sensors, it offers a wider monitoring range and higher deployment flexibility [3]. Recently, deep learning has promoted its progress in video analysis, some works have utilized 3D Convolutional Neural Network (3DCNN) [4, 5], Long Short-Term Memory (LSTM) [6, 7] or optical flow techniques [8, 9] to provide robust support for fall detection. However, their structural limitations often cause them to struggle with long-range spatiotemporal dependencies in videos. Challenges include difficulties in understanding the global context, low training efficiency and gradient issues. In contrast, the Vision Transformer (ViT) [10] takes global analysis as the starting point and brings a fresh perspective to image and video tasks. Although its application in fall detection is in the exploratory stage, the Multi-Head Self-Attention (MHSA) grants it advantages in perceiving global information and analyzing long-term dependencies. Therefore, this study utilizes Vision Transformer-Base (ViT-Base) [10] as the baseline model and introduces our novel approach.

However, ViT still faces two challenges in fall videos processing. Firstly, similar visual information across adjacent frames accumulates inter-frame redundancy, which not only consumes computing resources but also seriously obscures key details in videos. Although ViT can identify obvious motion, it has limitations in capturing subtle motion changes across adjacent frames. Inter-frame redundancy heightens the risk of model misjudgment. Secondly, when processing high-feature dimensional video frame data, numerous token interactions may cause the distribution of token attention weights within the frame to become relatively smooth. This greatly limits the model’s ability to focus on key features of the foreground moving target, causing it to over-focus on the background.

To address the above issues, we propose a visual perception enhancement fall detection algorithm based on Vision Transformer (VPE-ViT-FD). It aims to suppress redundancy across video frames and capture subtle motion changes, while optimizing the attention allocation of local key features within each frame. On one hand, we design a lightweight inter-frame motion information enhancement module (IMEM). First, a class token shift operation is performed, utilizing partial features of the class token to interact across adjacent frames. This enables continuous recognition of actions in long time series. Subsequently, a dual attention optimization unit (DAOU) is employed to deeply mine the spatial and channel feature dependencies. While suppressing redundancy and enhancing the feature representation, it makes up for the information loss caused by shifting and provides rich motion details for contextual analysis. On the other hand, we design a locality perception enhancement self-attention mechanism (LESA) to improve MHSA through two progressive strategies: a local feature association enhancement strategy (LFAES) and a trivial attention masking strategy (TAMS). The LFAES comprises two major components. Specifically, to promote the model’s understanding of relative positional relationships among tokens and establish stronger spatial connections for key visual elements, relative positional encoding is first introduced. After fully integrating the spatial relative positional information, a temperature-scaled Softmax function sharpens the distribution of token attention weights within the frame, initially enhancing the model's ability to aggregate local contexts. Furthermore, the TAMS is applied to the Softmax-normalized output. This strategy dynamically suppresses non-essential weights in the attention weight matrix to enhance LESA's attention to local key information within the frame. Our proposed VPE-ViT-FD can substantially enhance the model's comprehensive understanding of fall behavior in videos and effectively improve recognition accuracy.

In summary, our main contributions are as follows:

Considering the promising potential of ViT in the field of fall detection, we innovatively propose a ViT-based method called VPE-ViT-FD to comprehensively model the motion information in fall video sequences.
We design a lightweight IMEM that first promotes inter-frame information interaction via a class token shift module, then combines a DAOU to efficiently suppress redundancy while capturing subtle motion changes across video frames, thereby improving detection accuracy.
We propose LESA, which optimizes its ability to aggregate local contexts via two progressive strategies, LFAES and TAMS. LESA ensures that the model focuses on core features of the foreground target within the frame while reducing attention distractions from the background.
Our detailed experiments on two public benchmark fall video datasets, Le2i and UR, demonstrate the excellent detection performance and generalization ability of VPE-ViT-FD in the video fall detection task.

Related work

Fall detection aims to accurately and quickly identify falls and promptly alert caregivers to minimize casualties [11]. Vision-based fall detection methods stands out for their non-invasiveness, information diversity, and low cost [12] and have made substantial progress in deep learning. The attention mechanism [13, 14–15], a key breakthrough in deep learning, enhances the model’s focus on crucial features, improving interpretability and accuracy. In this context, the convolutional-free ViT (Pure-ViT) has emerged as a preferred architecture due to its excellent performance. Therefore, this section reviews vision-based deep learning methods and related work on the Pure-ViT architecture.

Vision-based deep learning methods

Early visual methods depended on 2D Convolutional Neural Network (2DCNN) for automatic feature mining, thus reducing the reliance on manual feature extraction. Yu et al. [16] pioneered CNN to fall detection, utilizing background subtraction to extract human contours from videos and input them into CNN to distinguish falls. Cai et al. [17] improved DenseNet’s architecture with a multi-channel convolutional fusion strategy to effectively detect falls and mitigate information loss. To address CNN’s limitations in time series processing, Recurrent Neural Network (RNN) and its variants were introduced. Chen et al. [6] proposed a fall detection method for complex backgrounds, utilizing Mask-RCNN and VGG16 to extract features in a noisy background, then passing them to a Bi-LSTM with attention to accurately identify falls. Reference [7] utilized YOLOv3 and Deep-Sort to detect and track pedestrians, and then fed features extracted by VGG16 into a spatiotemporal attention-guided LSTM to precisely locate falls. Additionally, deep neural networks are often combined with optical flow technology to model frame relationships. Reference [18] converted frames into optical flow maps to represent motion changes and used the improved VGG16 to process the stacked maps for fall classification. Carlier et al. [9] developed a fall detection system combining optical flow and CNN, optimizing decision-making in nursing homes to reduce false alarms and ensure accuracy. These methods primarily process features via 2DCNN. To fully capture the spatiotemporal information in videos, 3D networks are applied. Lu et al. [19] designed a detection algorithm based on 3DCNN, and used an LSTM-based visual attention scheme to focus on key areas for further accurate fall identification. In [5], an algorithm combining multi-stream 3DCNN and an image fusion approach was proposed, where each branch can handle different stages of falls separately, improving detection efficiency. Although the above work has achieved remarkable results in visual fall detection, these deep learning models still face challenges. Their structural design limits their ability to understand the long-term spatiotemporal feature associations in videos.

Pure vision transformer

Vision Transformer pushes self-attention to a new level and has achieved great success in computer vision. Dosovitskiy et al. [10] pioneered the use of pure ViT for image tasks, capturing global dependencies and laying the foundation for subsequent research. Due to the lack of convolutional inductive bias, ViT requires large amounts of training data. In response, some researchers have explored solutions. Lee et al. [20] introduced SPT and LSA into ViT to increase inductive bias and optimize its performance on small datasets. Reference [21] adopted a trivial attention suppression strategy to improve the training efficiency of ViT. There have also been efforts to improve ViT from positional encoding to understand the spatial relationship among tokens. Wu et al. [22] proposed a relative position encoding method called iRPE, which utilized various encoding strategies to optimize MHSA and notably improved the effect of spatial attention. To extend ViT to video analysis, researchers considered the design of temporal context modeling. TimeS-former [23] used four different designs to introduce temporal attention into ViT for action recognition. Zhang et al. [24] designed TokShift to improve ViT without adding additional parameters and computational costs. However, for both image and video tasks, the above Pure-ViT architecture faces challenges in processing local details. Although some work has focused on local features, there is still room for further exploration in improving MHSA.

Methodology

To thoroughly model the motion information in fall videos and address the limitations in capturing core fall features across and within frames, we propose VPE-ViT-FD. Specifically, we adopt ViT-Base as the baseline model to capture the long-range spatiotemporal dependencies in videos. On this foundation, IMEM mitigates inter-frame redundancy and enhances key feature representations to accurately capture fine-grained motion changes; LESA is designed to deeply comprehend spatial arrangements and intrinsic correlation of visual elements within each frame, improving model’s ability to focus on local key features of the moving target. We will introduce VPE-ViT-FD in detail, and its complete framework is shown in Fig. 1.

[See PDF for image]

Fig. 1

Complete framework of VPE-ViT-FD

Let the input be ${x \in R}^{T \times H \times W \times C}$ , each frame can be divided into non-overlapping patches of size $P \times P$ , denoted by $x_{0}^{i}$ for the $i$ -th patch. So $x$ can be reshaped as $\hat{x} = [x_{0}^{1}, \dots, x_{0}^{N}] \in R^{T \times N \times d}$ , where $T$ , $H$ , $W$ , $C$ represent the number of frames, height, width and number of channels per frame, respectively. $N = (H \times W) / P^{2}$ is the number of patches and $d = P^{2} \times C$ is the number of pixels in patch. The patches are linearly projected into a $D$ -dimensional embedding space via a trainable matrix $E \in R^{d \times D}$ to obtain patch tokens, and concatenated with a class token $c_{0} \in R^{T \times D}$ to integrate information. After adding the absolute positional encoding $E_{p} \in R^{(N + 1) \times D}$ , the input $y_{0} \in R^{T \times (N + 1) \times D}$ of the first encoder block can be obtained, as shown in Eq. (1):

y_{0} = [c_{0}, x_{0}^{1} E, \dots, x_{0}^{N} E] + E_{p}

Our model consists of 12 encoder blocks, each comprising LESA, IMEM, Layer Normalization (LN), and Feed Forward Network (FFN). They interconnect via residual connections and process each frame individually.

LESA LESA progressively improves MHSA with two strategies: the LFAES and the TAMS. Drawing inspiration from Suppress Masking [21] and iRPE [22], originally proposed for image tasks, we integrate them into LESA, considering their potential in video analysis. In addition, MHSA involves quadratic complexity in both memory usage and computation when processing tokens. While Suppress Masking and iRPE neglect the high computational cost they bring. This poses challenges to memory consumption and training efficiency. To this end, we adapt a method [25] to selectively discard patch tokens based on attention weights. The original intention of this method was to accelerate ViT training during adversarial settings. Nevertheless, our experiments reveal its equal effectiveness in boosting ViT’s efficiency during standard training. As a result, LESA ensures that model maintains high performance while achieving good training efficiency. Figure 2a depicts the overall design of LESA.

[See PDF for image]

Fig. 2

Complete framework of LESA (a) and IMEM (b)

Assume that the input of the $l$ -th encoder block is $y_{l - 1} \in R^{T \times (N + 1) \times D}$ , and the input of LESA at time $t$ is $L N (y_{l - 1}^{(t)}) \in R^{(N + 1) \times D}$ . Let the number of heads and the feature dimension per head be $H 1$ and $D_{h} = D / H 1$ . Then, the query matrix $Q^{(t)}$ , key matrix $K^{(t)}$ and value matrix $V^{(t)}$ of the $j$ -th head, namely $Q_{j}^{(t)}$ , $K_{j}^{(t)}$ , $V_{j}^{(t)}$ , are shown in Eq. (2), where $W_{Q}, W_{K}, W_{V} \in R^{D \times D_{h}}$ are the linear projection weights. For LFAES, we first utilize iRPE to deeply comprehend relative positional relationships among tokens and establish robust spatial connections for key visual elements. Specifically, we define trainable vectors $P_{ef} (K_{j}^{(t)})$ and $P_{ef} (Q_{j}^{(t)})$ , each with the same input dimension as LESA for $Q_{j}^{(t)}$ and $K_{j}^{(t)}$ , respectively. These vectors represent the relative position weight between position $e$ and $f$ , and interact with $Q_{j}^{(t)}$ and $K_{j}^{(t)}$ accordingly. The relative positional encoding $R_{ef}^{(t)}$ can be obtained by Eq. (3), where $T_{1}$ denotes matrix transpose. Based on a profound understanding of spatial arrangements, the distribution of token attention weights is sharpened using a Softmax function with a learnable temperature scaling parameter $τ$ (initially set to $\sqrt{D_{h}}$ ). LFAES preliminarily enhances the model's ability to aggregate local context. The output $y_{o 1}^{(t)} \in R^{(N + 1) \times (N + 1)}$ of LFAES is given by Eq. (4):

Q_{j}^{(t)}, K_{j}^{(t)}, V_{j}^{(t)} = (L, N, (y_{l - 1}^{(t)})) [W_{Q}, W_{K}, W_{V}]

R_{ef}^{(t)} = Q_{j}^{(t)} {(P_{ef}, (K_{j}^{(t)}))}^{T_{1}} + K_{j}^{(t)} {(P_{ef}, (Q_{j}^{(t)}))}^{T_{1}}

y_{o 1}^{(t)} = S o f t m a x (\frac{Q_{j}^{(t)} {(K_{j}^{(t)})}^{T_{1}} + R_{ef}^{(t)}}{τ})

On this basis, TAMS further enhances the model’s sensitivity to local key details. Specifically, let $x_{\max}^{(t)}$ be the maximum attention weight in a row of the attention weight matrix $y_{o 1}^{(t)}$ , and let $th$ be the threshold factor. A threshold is set as the product of $x_{\max}^{(t)}$ and $th$ . Attention weights below this threshold are considered trivial, and the others are non-trivial. Trivial attention weights refer to unnecessary attention weights that contribute little to the model’s performance, often accounting for a large proportion and introducing a lot of noise. Based on this, a trivial attention weights mask matrix $M^{(t)}$ with the same size as $y_{o 1}^{(t)}$ can be defined, where all trivial attention weights are set to 1, and the non-trivial ones are set to 0:

M^{(t)} = \{\begin{matrix} 1, & y_{o 1}^{(t)} \leq t h * x_{\max}^{(t)} \\ 0, & else \end{matrix})

Then, a learnable suppression factor $s$ is set to further suppress trivial attention weights. The output $y_{o 2}^{(t)} \in R^{(N + 1) \times (N + 1)}$ of TAMS can be expressed as follows, where $⊙$ denotes element multiplication:

y_{o 2}^{(t)} = (1 - M^{(t)}) ⊙ y_{o 1}^{(t)} + s * M^{(t)} ⊙ y_{o 1}^{(t)}

Let the output of the $j$ -th head be $h_{j}^{(t)} \in R^{(N + 1) \times D_{h}}$ , as shown in Eq. (7). If the outputs of all heads are directly concatenated, a tensor of shape $(N + 1) \times D$ can be obtained, but this will bring a high computational cost. To this end, we introduce a method [25] to selectively discard patch tokens based on attention magnitude. First, the Softmax-normalized $y_{o 1}^{(t)}$ of all heads undergoes a matrix operation to obtain the weighted average result $\frac{1}{H 1} \sum_{j = 1}^{H 1} {(y_{o 1}^{(t)})}^{j}$ that represents the attention of each token to all others. Each row of it reflects the importance of the input token to the output. Then, we sum the matrix by row to generate an index vector $id$ , where each element is the index of corresponding attention weight. Finally, the Top-K function selects the $K$ tokens with the highest weights from the $N$ patch tokens (excluding the class token) and discard the rest to obtain $y_{o 3}^{(t)} \in R^{(K + 1) \times D}$ , as shown in Eq. (8). If $W_{h} \in R^{D \times D}$ is the weight of linear projection, the calculation process for $y_{o}^{(t)} \in R^{(K + 1) \times D}$ , indicating the output $y_{o}$ of LESA at time $t$ , is as follows:

\begin{matrix} h_{j}^{(t)} = y_{o 2}^{(t)} V_{j}^{(t)} & , j = 1, \dots, H 1 \end{matrix}

y_{o 3}^{(t)} = \{C o n c a t [h_{1}^{(t)}, \dots, h_{H 1}^{(t)}], T o p K (i, d)\}

y_{o}^{(t)} = L E S A (L, N, (y_{l - 1}^{(t)})) = y_{o 3}^{(t)} W_{h}

This method drops a certain proportion of patch tokens after each LESA within an encoder block, according to the preset drop rate $dr$ . It effectively balances the training efficiency and performance of the model.

LESA makes full use of intrinsic feature association and spatial relationships among all visual elements. As a result, the model can concentrate on core fall features of the foreground target within the frame, while effectively reducing its excessive focus on the static background.

IMEM IMEM is a lightweight module comprising two main parts: a class token shift module and the DAOU. The structure of IMEM is shown in Fig. 2b.

In the $l$ -th encoder block, the input $y_{l}^{'} \in R^{T \times (K + 1) \times D}$ of IMEM can be obtained by Eq. (10):

y_{l}^{'} = y_{o} + y_{l - 1}

First, the class token shift module [24] assists ViT in recognizing complete actions in long time sequences by utilizing partial features of the class token to interact across adjacent frames. Let the input and output of the shift module be $y_{l, i}^{'}, y_{l, o}^{'} \in R^{T \times (K + 1) \times D}$ . Specifically, $y_{l, i}^{'} (t, n, d^{'})$ and $y_{l, o}^{'} (t, n, d^{'})$ represent the input and output with time $t$ , the $n$ -th token, and the $d^{'}$ -th channel (abbreviated as $(t, n, d^{'})$ ). The shift process is as follows:

y_{l, o}^{'} (t, 0, d^{'}) = \{\begin{matrix} y_{l, i}^{'} (t - 1,0, d^{'}) & , 1 < t \leq T, 1 \leq d^{'} < d_{1} \\ y_{l, i}^{'} (t + 1,0, d^{'}) & , 1 \leq t < T, d_{1} \leq d^{'} < d_{1} + d_{2} \\ y_{l, i}^{'} (t, 0, d^{'}) & , \forall t, d_{1} + d_{2} \leq d^{'} \leq D \end{matrix})

\begin{matrix} y_{l, o}^{'} (t, n, d^{'}) = y_{l, i}^{'} (t, n, d^{'}) & , \forall t, 1 \leq n < N, \forall d^{'} \end{matrix}

Equation (11) shows the temporal shift operation of the class token along the channel dimension $D$ . The first $d_{1}$ channels shift backward to $t - 1$ , and the next $d_{2}$ channels shift forward to $t + 1$ , and the remaining channels remain unchanged. Equation (12) represents the normal transmission of features other than the class token.

Then, DAOU adopts a feature grouping strategy to distribute features in parallel to $G$ groups of feature sub-channels along $D$ for separate processing, improving efficiency of screening useful features. For convenience, the output $y_{l, o}^{'}$ of the shift module can be recorded as $Y = [Y_{1}, \dots, Y_{G}] \in R^{T \times (K + 1) \times D}$ . Within each sub-channel, it is further split into two branches to mine key spatial and channel features, denoted as $Y_{K 1}, Y_{K 2} \in R^{T \times (K + 1) \times D / 2 G}$ .

In the channel attention branch, we utilize $G A P (∙)$ to perform a global average pooling operation on $Y_{K 1}$ . This aggregates the global feature representation of the channel and generates a channel-level statistical vector. First, the $R e L U (∙)$ function is employed to enhance the nonlinear representation capability of IMEM. Subsequently, a linear mapping $F c (∙)$ with the parameter $W_{1}, b_{1} \in R^{T \times 1 \times D / 2 G}$ is applied to achieve feature scaling and shifting. The Sigmoid activation function $σ (∙)$ can adaptively select the attention weight, and finally a residual design is used to fully utilize the original information, resulting in:

\begin{matrix} Y_{K 1}^{'} & = Y_{K 1} + σ (F, c, (R, e, L, U, (G, A, P, (Y_{K 1})))) * Y_{K 1} \\ = Y_{K 1} + σ (W_{1} * R e L U (G, A, P, (Y_{K 1})) + b_{1}) * Y_{K 1} \end{matrix}

In the spatial attention branch, group normalization $G N (∙)$ can obtain a spatial level statistical vector for $Y_{K 2}$ . It is then successively passed through $R e L U (∙)$ , a linear mapping $F c (∙)$ with $W_{2}, b_{2} \in R^{T \times 1 \times D / 2 G}$ and $σ (∙)$ to enhance the spatial features. Subsequently, the original information is integrated by residual design, as follows:

\begin{matrix} Y_{K 2}^{'} & = Y_{K 2} + σ (F, c, (R, e, L, U, (G, N, (Y_{K 2})))) * Y_{K 2} \\ = Y_{K 2} + σ (W_{2} * R e L U (G, N, (Y_{K 2})) + b_{2}) * Y_{K 2} \end{matrix}

Finally, the two branches are concatenated. Then, the features in the $G$ groups of feature sub-channels are fused through a feature aggregating strategy to obtain the output $y_{l}^{''} \in R^{T \times (K + 1) \times D}$ of IMEM.

Compared to works only performing token shifting [24, 26], IMEM makes up for the information loss caused by shifting and ensures feature integrity of context analysis. This enables the model to effectively suppress inter-frame redundancy and focus on capturing subtle motion changes.

In summary, the calculation process in the $l$ -th encoder block can be summarized as:

y_{l}^{'} = L E S A (L, N, (y_{l - 1})) + y_{l - 1}

y_{l}^{''} = I M E M (y_{l}^{'})

y_{l} = F F N (L, N, (y_{l}^{''})) + y_{l}^{'}

For the classification task, let $c_{L} \in R^{T \times D}$ be the class token from the last encoder. $F C (∙)$ is a fully connected layer of shape $D \times 2$ , where 2 represents the predicted categories, namely fall or non-fall. Thereby, the video label $y$ is obtained by averaging frame-level predictions:

y = \frac{1}{T} \sum_{t = 1}^{T} F C (c_{L})

Experiments

Dataset and experimental configuration

In this experiment, we evaluate our approach on two public video datasets for fall detection: the Le2i dataset [27] and the UR dataset [28]. Le2i was recorded by a single RGB camera with a frame rate of 25 fps and a resolution of 320 $\times$ 240. It contains 191 videos captured in four various life scenarios (coffee room, home, lecture, office). UR contains 70 RGB videos, including 30 falls and 40 activities of daily living with a frame rate of 32 fps and a resolution of 640 $\times$ 480. Both datasets consider various environmental factors, like lighting changes, shadows, etc.

We utilize ViT-Base as the baseline model with 12 encoders and set the number of heads of LESA to 12. The hidden layer dimension and FFN dimension are set to 768 and 3072. For ViT-Base, the original scale of the datasets is small. To increase the number and diversity of video samples, we cut them into frames and apply data enhancement strategies such as horizontal flipping, center cropping, etc. After preprocessing, the input frame size and patch size are set to 224 $\times$ 224 and 16 $\times$ 16. To stably assess the generalization of VPE-ViT-FD, we split the dataset into the training set and test set in a ratio of 8:2 and employ the five-fold cross-validation for average results.

Considering computing resources, we build the model on a server equipped with an NVIDIA Tesla V100-32 GB GPU and an Intel® Xeon® Platinum 8255C CPU. The model is built using the deep learning framework of Pytorch 1.11.0, with Python 3.8 and CUDA 11.3.

In this experiment, we choose Mini-Batch SGD as the optimizer and adopt a hyperparameter search strategy[29] to find the best assignment, as shown in Table 1. We use the model pre-trained on ImageNet-21 K [30] to initialize the weights. The initial learning rate is set to 0.00275 and decays to 10% of the original value every 15 epochs.

Table 1. Hyperparameter settings used in Mini-Batch SGD

Hyperparameter	Search space	Best assignment
Milestones	{10,20}	15
Gamma	{0.1,0.5}	0.1
Batchsize	{4,16}	12
Number of epochs	{20,50}	30
Base learning rate	{10e−2,5e−3; 10e−3,5e−4}	2.75e−3
Weight decay	{1e−5,5e−}	1e−5

Evaluation metrics

To evaluate the comprehensive performance of the proposed algorithm, we use precision, sensitivity, specificity, accuracy and F-Score as evaluation metrics:

Accuracy is defined as the ratio of correctly classified instances to the total number of samples:

A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)}

Sensitivity represents the proportion of all fall instances that are correctly identified as falls, emphasizing the performance of the model in detecting fall instances:

S e n s i t i v i t y = \frac{TP}{(T P + F N)}

Specificity is defined as the proportion of all non-fall instances that are correctly detected as non-fall behaviors, focusing on the model's ability to detect non-fall instances:

S p e c i f i c i t y = \frac{TN}{(T N + F P)}

Precision represents the proportion of correct detections among all instances classified as falls, that is:

P r e c i s i o n = \frac{TP}{(T P + F P)}

The F-Score of the model is the harmonic mean of sensitivity and precision, which is particularly effective under the condition of imbalanced datasets, that is:

F - S c o r e = 2 \times \frac{P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y}

Parameter tuning experiments

To investigate the impact of the internal parameter configuration of IMEM and LESA on VPE-ViT-FD, we conduct tuning experiments on Le2i. These experiments are crucial for gaining a deeper understanding of VPE-ViT-FD and lay the foundation for further experiments.

To examine the specific impact of the number of feature groups of DAOU in IMEM on model performance, we conduct experiments. Table 2 presents the accuracy results of VPE-ViT-FD under different number of groups.
Table 2. Accuracy of VPE-ViT-FD under different number of groups

Groups
2
4
8
16
32
Accuracy (%)
97.92
98.09
99.25
98.26
98.15

Bold indicates the best
We believe that the effectiveness of feature grouping depends significantly on the number of groups. Too few groups (2 or 4) may weaken the interaction among feature sub-channels; while excessive groups(16 or 32) may cause information loss problems, resulting in a decline in model performance. Therefore, we set the number of groups of DAOU to 8, which achieves an accuracy of 99.25%.
We tune the drop rate $dr$ in LESA to find the best balance between model performance and training efficiency. To this end, we compare the accuracy of VPE-ViT-FD under different $dr$ and two common metrics for measuring computational complexity: Params(in millions, M) and FLOPs(in billions, G), as shown in Table 3:
Table 3. Results of related metrics of VPE-ViT-FD under different $dr$

$dr$
Params(M)
FLOPs (G)
Accuracy (%)
0
85.88
142.15
99.48
0.05
85.88
102.76
99.31
0.075
85.88
88.77
99.25
0.1
85.88
76.81
98.44
0.2
85.88
46.35
82.81

Bold indicates the best
As presented in Table 3, $dr$ mainly affects FLOPs and accuracy. Specifically, with $dr$ at 0.05, the model maintains an accuracy close to non-drop configuration, but the FLOPs are still high; when $dr$ increases to 0.075, the accuracy remains reasonable, while the decrease in FLOPs is significant, and the model’s training efficiency is effectively improved; however, as $dr$ further increases, although FLOPs continue to decrease, the accuracy drops notably. This indicates that excessive discarding of patch tokens will severely weaken the model’s feature representation ability. Therefore, we set $dr$ to 0.075 to achieve the best balance between accuracy and efficiency.
We employ the grid search method [21] for the threshold factor $th$ and suppression factor $s$ of the TAMS in LESA to determine which combination can best suppress trivial attention weights. The setting range for $th$ is [0.25, 0.1, 0.075, 0.05, 0], the range for $s$ is [0.75, 0.5, 0.25, 0.1, 0]. The accuracy results of VPE-ViT-FD under different combinations are shown in Table 4:
Table 4. Accuracy of VPE-ViT-FD under different combinations

$s$
0.75
0.5
0.25
0.1
0
$th$
0.25
97.22
96.93
96.88
96.70
95.95
0.1
97.40
97.92
98.26
97.74
96.06
0.075
97.57
98.44
99.25
98.15
96.53
0.05
96.35
97.11
98.09
97.28
95.66
0
97.05

Bold indicates the best
As can be observed, the optimal combination of TAMS is [ $th$ , $s$ ] = [0.075, 0.25], with an accuracy of 99.25%. The video tasks involve the understanding of temporal information and complex dynamic scenes. Additionally, we have implemented the LFAES before suppressing the trivial attention, thereby making our parameter settings and conclusions not completely consistent with those of [21] for 2D image tasks.

Groups	2	4	8	16	32
Accuracy (%)	97.92	98.09	99.25	98.26	98.15

$dr$	Params(M)	FLOPs (G)	Accuracy (%)
0	85.88	142.15	99.48
0.05	85.88	102.76	99.31
0.075	85.88	88.77	99.25
0.1	85.88	76.81	98.44
0.2	85.88	46.35	82.81

$s$	0.75	0.5	0.25	0.1	0
0.25	97.22	96.93	96.88	96.70	95.95
0.1	97.40	97.92	98.26	97.74	96.06
0.075	97.57	98.44	99.25	98.15	96.53
0.05	96.35	97.11	98.09	97.28	95.66
0	97.05

Based on Table 4, we make the following inferences:

Threshold factor $th$ If $th$ is set too high, non-trivial attention may be misclassified as trivial, preventing the model from fully utilizing useful information; if $th$ is set too low, more trivial attention will be retained, which will introduce excessive noise into the training process and affect the model’s performance.
Suppression factor $s$ The trivial attention is determined based on the set threshold, which also contains some useful information. If $s$ is set too large, trivial attention will dominate, making it difficult for the model to focus on non-trivial attention; if $s$ is set too small, trivial attention will be overly suppressed, preventing the model from fully learning the useful information.
Zero value Without applying suppression (the last row of Table 4, $th$ =0), the accuracy sometimes exceeds suppressed cases. This may be due to improper parameter settings in some cases. Yet, after gradually optimizing the parameter configuration via grid search, it can be observed that the accuracy with suppression is higher in most cases. When all trivial attentions are removed (the last column of Table 4, $s$ =0), the model’s accuracy drops significantly, indicating that trivial attentions contain a certain amount of useful information. Completely removing these attentions will severely impact the model’s performance.

Given the similarity in task between UR and Le2i, and to maintain consistency in our experimentation, we adopt the same parameter settings identified for Le2i in our subsequent experiments on UR. This enables us to leverage the insights gained from Le2i and efficiently investigate the generalization capability of VPE-ViT-FD.

Ablation experiments

In this section, we design ablation experiments to examine the performance gains of the innovative components of proposed algorithm on the baseline model. We compare the performance of the baseline algorithm (ViT-FD), the variant with only IMEM added (IMEM-FD), the variant with only LESA added (LESA-FD), and the proposed integrated algorithm (VPE-ViT-FD) on Le2i and UR. The parameter settings for all four algorithms are kept consistent, and all are pre-trained on ImageNet-21 K based on ViT-Base to ensure a fair comparison.

To efficiently observe the impact of parameter settings on the experimental results, the experiment adopts the method of simultaneous training and testing. The curve of the test accuracy varying with epochs is shown in Fig. 3.

[See PDF for image]

Fig. 3

Epoch-Accuracy curves of VPE-ViT-FD and related algorithms on Le2i (a) and UR (b)

It can be observed in (a) that despite the application of data enhancement and the import of pre-trained weights, Le2i remains a small dataset for ViT. In early training, the curves of all four algorithms exhibit fluctuations. MHSA, as the core of ViT, serves as the primary driver of its overall performance. Therefore, LESA-FD and VPE-ViT-FD, which are specifically optimized for MHSA, can learn the intrinsic patterns of the data more quickly than IMEM-FD, which improves model’s capability through plug-in modules. Consequently, their training curves will get rid of the fluctuation stage earlier.

From an overall perspective, in the initial few epochs, the learning trends of four curves are similar. However, as iterations increase, the performance differences gradually emerge. Specifically, ViT-FD, due to its lack of inductive bias, has a limited understanding of local details, so it exhibits a convergence trend before reaching the preset Milestones (Epoch = 15); IMEM-FD provides ViT with rich details by suppressing inter-frame redundancy and capturing subtle motion changes, so its performance is more significantly improved than ViT-FD; LESA-FD effectively alleviates the smoothing issue of token attention weights distribution and focuses more on key features of moving target, outperforming IMEM-FD; our VPE-ViT-FD fully models motion information and considers the limitations of capturing core features across and within frames, and demonstrate the best performance. The learning trends of curves observed in (b) for UR mirror those found on Le2i. Notably, the smaller size of UR leads to an initially higher accuracy and faster convergence of the curves. The results on UR validate the generalization capability of our method.

To comprehensively evaluate the performance of the four algorithms on Le2i and UR, Table 5 and Table 6 list details regarding the metrics described in Sect. 4.2.

Table 5. Ablation study of VPE-ViT-FD and related algorithms on Le2i

Methods	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-Score
ViT-FD	96.11	95.03	90.06	95.42	0.96
IMEM-FD	97.71	97.16	94.32	97.40	0.97
LESA-FD	98.41	97.88	95.77	98.15	0.98
VPE-ViT-FD	99.32	99.18	98.36	99.25	0.99

Bold indicates the best

Table 6. Ablation study of VPE-ViT-FD and related algorithms on UR

Methods	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-Score
ViT-FD	96.88	95.24	90.48	96.08	0.96
IMEM-FD	97.76	97.20	95.24	97.55	0.97
LESA-FD	98.78	98.21	96.43	98.53	0.98
VPE-ViT-FD	99.52	99.43	98.86	99.48	0.99

Bold indicates the best

As indicated in Tables 5 and 6, VPE-ViT-FD outperforms other algorithms in all five metrics. This proves that VPE-ViT-FD's improvements in addressing both inter-frame and intra-frame issues contribute significantly to its superior detection performance.

Visualization experiments

To intuitively compare the focusing ability of all four.

algorithms on foreground moving target in fall videos, we adopt Grad-Cam [31] for visualization experiments. Figure 4 shows the visualization results of the target at the moment of final falls in five scenarios: four from Le2i (coffee room, lecture, office and home) and one from UR.

[See PDF for image]

Fig. 4

Visualization results of all four algorithms on Le2i and UR

By analyzing the visualization results on Le2i, we find all four algorithms detect fallen targets in coffee room and lecture. While VPE-ViT-FD shows a better focusing effect, its difference with other algorithms is not notable. However, in office and home scenarios, the results show obvious differences. It can be observed that a considerable amount of attention is easily drawn to the background. We speculate that this may be related to the fact that the shooting angle of the home and office is relatively higher than that of the previous ones. Hence, obstacles like tables and chairs may more easily interfere with fallen targets detection. Specifically, due to ViT-FD's limited ability to understand local details, its detection effect is not ideal; although IMEM-FD and LESA-FD alleviate this issue to an extent by improving the model’s ability to capture local details, their discrimination ability remains limited; in contrast, VPE-ViT-FD can comprehensively capture core features across and within frames, and maximize the focus on the target. This advantage is evident in the visualization experiment for UR as well, thus proving the effectiveness and robustness of VPE-ViT-FD in video fall detection.

Comparative experiments

To further objectively validate the effectiveness of VPE- ViT-FD, we conduct comparative experiments on Le2i and UR, using the metrics reported by four advanced methods: two methods based on handcrafted features [32, 33] and two deep learning methods [8, 9]. As presented in Table 7 and Table 8, our method outperforms advanced algorithms across all five metrics. Compared to traditional methods [32, 33], VPE-ViT-FD automatically extracts rich features for classification, and saves manual extraction efforts. Moreover, it produces more robust and accurate detection results in complex scenarios. In comparison to deep learning methods [8, 9], VPE-ViT-FD directly processes RGB data without extra modal preprocessing, while exhibiting notable advantages in parsing global information and long sequence temporal correlations. On this basis, we optimize the capture of local details and dynamic changes across and within frames, as well as the suppression of redundancy, thereby.

Table 7. Comparison of VPE-ViT-FD and other existing advanced algorithms on Le2i, including the publication year and sources from Journal (J) and Conference (C)

Methods	Source	Year	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-Score
Carlier et al. [9]	C	2020	92.80	93.50	97.60	–	0.94
Dentamaro et al. [32]	C	2021	87.00	90.00	90.00	91.00	0.91
Beddiar et al. [33]	J	2022	70.70	69.30	61.30	65.90	0.70
Berlin et al. [8]	J	2022	97.00	96.00	96.00	96.00	0.96
Ours	J	2024	99.32	99.18	98.36	99.25	0.99

The symbol “–” indicates unknown and bold indicates the best

Table 8. Comparison of VPE-ViT-FD and other existing advanced algorithms on UR, including the publication year and sources from Journal (J) and Conference (C)

Methods	Source	Year	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-Score
Carlier et al. [9]	C	2020	85.30	96.70	94.10	–	0.94
Dentamaro et al. [32]	C	2021	94.00	93.00	88.00	90.00	0.90
Beddiar et al. [33]	J	2022	92.00	91.00	93.00	87.00	0.87
Berlin et al. [8]	J	2022	96.00	87.00	97.00	96.00	0.91
Ours	J	2024	99.52	99.43	98.86	99.48	0.99

The symbol “–” indicates unknown and bold indicates the best

further improving VPE-ViT-FD’s overall performance.

Conclusion

We propose VPE-ViT-FD and adopt ViT-Base as the baseline model to learn global feature representation in video sequences. To effectively suppress redundancy and capture subtle motion changes across video frames, we further design the IMEM. Simultaneously, we design the LESA to improve the model's ability to focus on local key features related to falls within the frame. Experimental results on Le2i and UR show that, compared with several existing advanced algorithms, our VPE-ViT-FD can accurately and efficiently detect fall behavior.

Author contribution

Cai, X. was responsible for conceptualization, methodology, investigation, formal analysis, and supervision. Wang, X.C. focused on data curation and writing the original draft. Bao, K.X. contributed to visualization design and investigation. Chen, Y.N. provided resources and software support. Jiao, Y conducted literature reviews and references. Han, G. participated in conceptualization, supervision, as well as writing the review and editing of the manuscript. All authors reviewed the manuscript, jointly ensuring the accuracy and integrity of the research.

Data availability

The research uses the Le2i [27] and UR [28] datasets from publicly available fall detection datasets. Data will be made available on reasonable request.

Declarations

Conflict of interest

The authors declare no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Montero-Odasso, M; Van Der Velde, N; Martin, FC et al. World guidelines for falls prevention and management for older adults: a global initiative [J]. Age Age.; 2022; 51, 9 205. [DOI: https://dx.doi.org/10.1093/ageing/afac205] 0777.90019

2. Newaz, NT; Hanada, EJS. The methods of fall detection: a literature review [J]. Sensors.; 2023; 23, 11 5212. [DOI: https://dx.doi.org/10.3390/s23115212] 07792984

3. Alam, E; Sufian, A; Dutta, P et al. Vision-based human fall detection systems using deep learning: A review. Comput. Bio. Med.; 2022; 146, 105626. [DOI: https://dx.doi.org/10.1016/j.compbiomed.2022.105626] 1501.35039

4. Zou, S; Min, W; Liu, L et al. Movement tube detection network integrating 3d cnn and object detection framework to detect fall. Electronics; 2021; 10, 8 898. [DOI: https://dx.doi.org/10.3390/electronics10080898] 1304.34139

5. Alanazi, T; Muhammad, GJD. Human fall detection using 3D multi-stream convolutional neural networks with fusion. Diagnostics; 2022; 12, 12 3060. [DOI: https://dx.doi.org/10.3390/diagnostics12123060] 1498.05237

6. Chen, Y; Li, W; Wang, L et al. Vision-based fall event detection in complex background using attention guided bi-directional LSTM. IEEE Access.; 2020; 8, pp. 161337-161348. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3021795]

7. Feng, Q; Gao, C; Wang, L et al. Spatio-temporal fall event detection in complex scenes using attention guided LSTM. Pattern Recognit. Lett.; 2020; 130, pp. 242-249. [DOI: https://dx.doi.org/10.1016/j.patrec.2018.08.031] 1174.60445

8. Berlin, SJ; John, MJJOAI; Computing, H. Vision based human fall detection with Siamese convolutional neural networks. J. Ambient Int. Human. Comput.; 2022; 13, 12 pp. 5751-5762. [DOI: https://dx.doi.org/10.1007/s12652-021-03250-5]

9. Carlier A, Peyramaure P, Favre K, et al. Fall detector adapted to nursing home needs through an optical-flow based CNN, In Proceedings of the 2020 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), F, 2020 [C]

10. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 x 16 words: Transformers for image recognition at scale [J]. 2020

11. Tanwar R, Nandal N, Zamani M, et al. Pathway of trends and technologies in fall detection: a systematic review, In Proceedings of the healthcare, F, 2022 [C]. MDPI

12. Lage, I; Braga, F; Almendra, M et al. Older people living alone: A predictive model of fall risk. Int. J. Environ. Res. Public Health.; 2023; 20, 13 6284. [DOI: https://dx.doi.org/10.3390/ijerph20136284] 1546.93586

13. Hu J, Shen L, Sun G. Squeeze-and-excitation networks, In Proceedings of the proceedings of the IEEE conference on computer vision and pattern recognition, F, 2018 [C]

14. Woo S, Park J, Lee J-Y, et al. Cbam: convolutional block attention module, In Proceedings of the proceedings of the European conference on computer vision (ECCV), F, 2018 [C]

15. Zhang Q-L, Yang Y-B. Sa-net: Shuffle attention for deep convolutional neural networks, In Proceedings of the ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), F, 2021 [C]

16. Yu M, Gong L, Kollias S. Computer vision based fall detection by a convolutional neural network, In Proceedings of the proceedings of the 19th ACM international conference on multimodal interaction, F, 2017 [C]

17. Cai, X; Liu, X; An, M et al. Vision-based fall detection using dense block with multi-channel convolutional fusion strategy. IEEE Access; 2021; 9, pp. 18318-18325. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3054469]

18. Núñez-Marcos, A; Azkune, G; Arganda-Carreras, IJWC et al. Vision-based fall detection with convolutional neural networks. Wireless Commun. Mobile Comput.; 2017; 2017, 1 9474806.

19. Lu, N; Wu, Y; Feng, L et al. Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data. IEEE J. Biomed. Health Inform.; 2018; 23, 1 pp. 314-23. [DOI: https://dx.doi.org/10.1109/JBHI.2018.2808281] 1509.34017

20. Lee S H, Lee S, Song B C J A P A. Vision transformer for small-size datasets [J]. 2021

21. Chen X, Hu Q, Li K, et al. Accumulated trivial attention matters in vision transformers on small datasets, In Proceedings of the proceedings of the IEEE/CVF winter conference on applications of computer vision, F, 2023 [C]

22. Wu K, Peng H, Chen M, et al. Rethinking and improving relative position encoding for vision transformer, In Proceedings of the proceedings of the IEEE/CVF international conference on computer vision, F, 2021 [C]

23. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?, In Proceedings of the ICML, F, 2021 [C]

24. Zhang H, Hao Y, Ngo C-W. Token shift transformer for video classification, In Proceedings of the proceedings of the 29th ACM international conference on multimedia, F, 2021 [C]

25. Wu B, Gu J, Li Z, et al. Towards efficient adversarial training on vision transformers, In Proceedings of the European conference on computer vision, F, 2022 [C]

26. Hashiguchi R, Tamaki T J A P A. Vision transformer with cross-attention by temporal shift for efficient action recognition [J]. 2022

27. Charfi, I; Miteran, J; Dubois, J et al. Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and Adaboost-based classification. J. Electron. Imag.; 2013; 22, 4 041106. [DOI: https://dx.doi.org/10.1117/1.JEI.22.4.041106] 1109.94323

28. Kwolek, B; Kepski, MJCM; Biomedicine, PI. Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Programs Biomed.; 2014; 117, 3 pp. 489-501. [DOI: https://dx.doi.org/10.1016/j.cmpb.2014.09.005] 1058.68671

29. Dodge J, Gururangan S, Card D, et al. Show your work: improved reporting of experimental results [J]. 2019

30. Krizhevsky A, sutskever I, Hinton G E J A I N I P S. Imagenet classification with deep convolutional neural networks [J]. 2012, 25

31. Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: visual explanations from deep networks via gradient-based localization, In Proceedings of the proceedings of the IEEE international conference on computer vision, F, 2017 [C]

32. Dentamaro V, Impedovo D, Pirlo G. Fall detection by human pose estimation and kinematic theory, In Proceedings of the 2020 25th international conference on pattern recognition (ICPR), F, 2021 [C]. IEEE

33. Beddiar, DR; Oussalah, M; Nini, BJJOVC et al. Fall detection using body geometry and human pose estimation in video sequences. J. Visual Commun. Image Represent.; 2022; 82, 103407. [DOI: https://dx.doi.org/10.1016/j.jvcir.2021.103407] 1479.35224

Word count: 6190

Show less

$s$	0.75	0.5	0.25	0.1	0
$th$	0.75	0.5	0.25	0.1	0
0.25	97.22	96.93	96.88	96.70	95.95
0.1	97.40	97.92	98.26	97.74	96.06
0.075	97.57	98.44	99.25	98.15	96.53
0.05	96.35	97.11	98.09	97.28	95.66
0	97.05

Visual perception enhancement fall detection algorithm based on vision transformer

Content area

Abstract

Full text