MSCDGCN: Multi-scale central difference graph convolution network for skeleton-based action recognition

Abstract

Effectively distinguishing fine-grained actions remains a critical challenge in skeleton-based action recognition. Existing Graph Convolutional Network methods often overlook directional motion cues and fail to integrate spatiotemporal features efficiently. To address these limitations, this paper proposes a novel Multi-Scale Central Difference Graph Convolutional Network (MSCDGCN) for skeleton-based action recognition. This model introduces a sparse self-attention-based central difference graph convolution that highlights key joints, enhancing local feature extraction while capturing contextual dependencies and intrinsic skeletal topology. A Spatial Temporal Joint Focus (STJF) module is designed to efficiently fuse spatial features extracted by the self-attention central difference graph convolution with temporal features obtained through Multi-Scale Separable Temporal Convolution (MSSTC). Experiments on NTU RGB + D 60 and NTU RGB + D 120 demonstrate state-of-the-art performance. Specifically, MSCDGCN achieves accuracies of 92.8% (X-Sub) and 96.8% (X-View) on NTU RGB + D 60, surpassing previous methods. On the larger NTU RGB + D 120 dataset, it attains accuracies of 89.5% (C-Sub) and 91.0% (C-Set). Furthermore, the recognition accuracy rate on the cross-dataset Kinetics is recorded at 39.8%, thereby validating its performance advantage. Ablation studies confirm the contributions of each module: SACD and STJF collectively enhance accuracy by 1.0% through directional feature learning and spatiotemporal fusion mechanisms. These results show that MSCDGCN effectively tackles fine-grained recognition and computational efficiency. This framework offers a robust solution for real-world applications necessitating precise motion differentiation, such as surveillance systems and human–computer interaction scenarios.

Full text

Translate

Turn on search term navigation

Introduction

With the rapid development of human–computer interaction, video surveillance, autonomous driving, and related fields (Karim et al. 2024; Selvi et al. 2022; Shi 2024), action recognition has attracted increasing attention from researchers. Its goal is to identify human behaviors accurately and efficiently from visual data. Traditional pixel-based or optical-flow-based methods, however, often struggle with unstable performance when facing complex backgrounds, illumination changes, viewpoint variations, or human occlusions. In contrast, skeleton-based action recognition has gained prominence due to its robustness in challenging environments (Miao and Meunier 2022) and its ability to capture the intrinsic nature of human motion. Skeleton sequences directly encode the human body’s topological structure. This representation captures semantic motion features that improve recognition accuracy and help distinguish subtle action differences—an essential requirement in dynamic real-world settings. Skeleton-based action recognition methods utilizing GCNs typically identify actions by fusing spatial and temporal features extracted from skeletal data (Xiong et al. 2023). Yan et al. (2018) proposed ST-GCN, which captures deep spatial–temporal patterns in skeleton data by stacking graph convolution operations across spatial and temporal dimensions. Although ST-GCN and numerous subsequent GCN-based approaches (Shi et al. 2019; Tu et al. 2023) have demonstrated notable advancements in action classification and recognition, they still face challenges when handling long-term sequences and fine-grained actions. To address this, Filtjens et al. (2024) introduced MS-GCN, a framework that leverages dilated temporal convolutions to derive temporal features, enabling the extraction of long-term dependencies based on spatial hierarchies between joints. Liu et al. (2022a, b) proposed a spatially-focused attention mechanism for fine-grained skeleton-based tasks, which constructs hierarchical tree-structured attention graphs to capture local joint dependencies. Shi et al. (2019) developed 2 s-AGCN, incorporating non-local neural network modules to adaptively learn graph structures.

Early approaches primarily relied on handcrafted feature extraction and classifier design, requiring customized features for different action classes. While effective to some extent, these methods lacked generalization. In recent years, deep learning has achieved remarkable success in action recognition due to its superior feature extraction capabilities. Deep learning–based methods (Cheng et al. 2023; Nguyen et al. 2023; Qiu et al. 2022; Shu et al. 2022) typically represent skeletons as 2D/3D joint coordinates or pseudo-image sequences, and then apply Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) to capture intra-frame features and temporal dependencies. Although RNNs and CNNs have proven effective in fields such as image classification and medical diagnostics (Awotunde et al. 2023; Oguntoye et al. 2023), skeleton data are more naturally represented as a graph, where joints correspond to nodes and bones to edges. This makes Graph Convolutional Networks (GCNs) (Liu et al. 2024; Shi et al. 2020a, b) particularly well-suited for modeling skeletal topology in action recognition.

GCN-based approaches typically fuse spatial and temporal features extracted from skeleton sequences (Xiong et al. 2023). For example, Yan et al. (2018) proposed ST-GCN, which captures spatial–temporal patterns by stacking graph convolutions across both dimensions. Subsequent works (Shi et al. 2019; Tu et al. 2023) have improved action classification and recognition, yet challenges remain in handling long-term dependencies and fine-grained actions. To address this, Filtjens et al. (2024) introduced MS-GCN, which employs dilated temporal convolutions to capture long-term dependencies. Liu et al. (2022a, b) proposed a spatially focused attention mechanism that constructs hierarchical tree-structured attention graphs to model local joint dependencies. Similarly, Shi et al. (2019) developed 2 s-AGCN, incorporating non-local modules to adaptively learn graph structures.

Despite these advancements, current methods still suffer from inefficient modeling of directional motion cues and limited integration of spatial–temporal features, reducing their ability to discriminate fine-grained actions. Conventional GCNs aggregate features without explicitly capturing directional motion (e.g., differences between “putting on a hat” and “taking off a hat”), weakening recognition of subtle behaviors. Moreover, many apply attention only in the spatial domain, ignoring temporal dynamics. This decoupled processing often leads to information loss, especially under viewpoint variations. This decoupled processing often causes information loss in dynamic scenarios involving viewpoint changes. In addition, multi-stream architectures commonly used in prior works increase model complexity and computational cost.

To overcome these challenges, the Multi-Scale Central Difference Graph Convolution Network (MSCDGCN) is proposed. The model incorporates three key components:

Direction-aware graph convolution

The SACD module integrates sparse self-attention with central difference graph convolution, enabling precise capture of directional motion patterns and effective spatial feature learning while suppressing noise from non-critical joints.

Efficient multi-scale temporal modeling

The MSSTC module combines hourglass-shaped separable convolutions with dilated kernels, achieving hierarchical temporal feature extraction while reducing computational complexity compared to standard implementations.

Unified spatial–temporal optimization

The STJF module introduces a novel joint attention mechanism that simultaneously processes spatial and temporal dimensions, addressing the feature misalignment problem in existing decoupled approaches.

The MSCDGCN framework is trained with an Information Bottleneck–based loss, ensuring compact yet discriminative skeletal representations. Together, these innovations significantly enhance the recognition of fine-grained actions while maintaining efficiency, making the model suitable for real-world deployment.

Related work

In the field of human action recognition, researchers have explored multiple modalities, including RGB images, optical flow, and skeletal data. The skeletal modality encodes the human body as a topology of joints and bones. Under challenging conditions—such as changes in illumination, viewpoint, or motion speed—skeleton data are more stable than other modalities. As a result, skeleton-based action recognition has become a central focus of recent research.

Early approaches (Vemulapalli et al. 2014) primarily relied on handcrafted features designed to capture kinematic aspects such as joint rotations and translations. However, these methods often ignored intrinsic relationships between joints, and their feature engineering processes were both complex and yielded suboptimal performance. With advances in deep learning, neural network-based approaches for skeleton action recognition have gained prominence. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have been widely adopted for this task. RNN-based methods model skeleton data as sequences of coordinate vectors, each corresponding to a human joint. For example, Shu et al. (2022) proposed SC-RNN, which jointly captures spatial consistency among joints and temporal dynamics across sequences. Ma et al. (2021) introduced MESN, which uses a multi-reservoir architecture to model multi-scale temporal dependencies. Zhang et al. (2019) developed an LSTM network with viewpoint adaptation to improve recognition from diverse angles. Nevertheless, RNNs often overemphasize temporal information while neglecting spatial relationships between joints. On the other hand, CNNs (Qiu et al. 2022), known for learning structural features from 2D data, have been applied by encoding skeleton joints into 2D pseudo-images (Cheng et al. 2023) for feature extraction. Wang et al. (2018), for instance, proposed the JTM method, which converts 3D skeleton sequences into multiple 2D images processed by CNNs. Despite their strengths, CNNs are limited in modeling temporal dependencies, making integrated spatiotemporal feature learning a persistent challenge. Moreover, although the skeleton naturally forms a graph in non-Euclidean space (Liu et al. 2022a, b), many methods still underutilize this structural property and generalize poorly to arbitrary skeleton configurations.

Graph Convolutional Network (GCN)-based methods have excelled in extracting discriminative spatiotemporal features from skeleton sequences (Liu et al. 2018). Within GCNs, the construction of the adjacency matrix is crucial for model performance. Recent approaches (Gao et al. 2022; Liu et al. 2020) often employ learnable adjacency matrices to capture latent relationships among joints—even those not physically connected. Li et al. (2019) proposed AS-GCN, which uses a learnable adjacency matrix for spatial feature aggregation and an encoder-decoder to capture action-specific dependencies. Si et al. (2019) introduced AGC-LSTM, which replaces fully connected layers in LSTM gates with graph convolutions to leverage skeletal topology in long-term temporal modeling. Shi et al. (2019) presented 2 s-AGCN, incorporating second-order features like joint coordinates and bone vectors (i.e., directional vectors between adjacent joints) to improve classification accuracy. Ye et al. (2020) designed a context-aware GCN that automates topology learning and integrates global context during joint-pair interaction analysis. Other methods combine GCNs with complementary mechanisms: Cheng et al. (2020a, b) proposed Shift-GCN, which uses temporal feature shifting to efficiently model motion dynamics; Zhang et al. (2020) developed SGN, incorporating semantic information (e.g., joint type and frame index) and hierarchical relationships to enhance representation while reducing complexity; and Liu et al. (2020) introduced a disentangled GCN with inter-temporal edges to aggregate features across frames. Despite these advances, many methods still overlook subtle inter-node differences and directional motion cues.

Attention mechanisms (Xia et al. 2022) allow models to emphasize informative regions of the skeleton—such as action-relevant joints or bones—thereby boosting recognition performance. Several studies have integrated attention into action recognition. Zhou et al. (2020) proposed a MAT module that uses asymmetric attention in a two-stream encoder to fuse motion and appearance features hierarchically. Lee et al. (2018) introduced GAM, which combines RNNs and self-attention to focus on relevant regions in graph-structured data. Vrahatis et al. (2024) proposed GAT, using multi-head self-attention followed by concatenation or averaging to compute node representations influenced by neighbors. Zhang et al. (2018) designed a gated graph self-attention mechanism that adaptively weights attention heads and refines graph topology, showing gains in classification accuracy. Xu et al. (2024) developed a framework featuring spatiotemporal decoupling and contrastive learning to obtain richer action representations by comparing compressed spatial and temporal features. Although these methods reveal the potential of attention in global spatiotemporal modeling, body-informed topological priors remain essential for spatial representation. Relying solely on attention to infer spatial structure (Bai et al. 2022; Shi et al. 2020a, b) proves inadequate for capturing dynamic spatial relationships in action sequences. Furthermore, channel-specific variations in topology warrant deeper investigation. Additionally, attention mechanisms alone often fail to effectively capture fine-grained local motion patterns.

Method

Existing methods often rely on fixed physical topologies and struggle to capture the intrinsic joint relationships critical for action understanding. Many approaches overlook multi-scale temporal dynamics or fail to reduce spatiotemporal redundancy, limiting representation quality. To address these gaps, this paper proposes MSCDGCN, which learns the intrinsic topology, models multi-scale temporal dependencies, and enhances feature correlations through spatial–temporal joint attention.

Preliminaries

Skeleton action recognition involves using deep learning techniques to extract human body key points from images or videos, followed by identifying specific behaviors or actions by analyzing the movement patterns and spatial relationships of these key points. The human skeleton can be depicted as a system of joints and bones, with joints serving as connections between adjacent bones. Consequently, the skeleton can be simplified into a graph composed of points and edges. The human skeleton can be represented as a graph $G = (V, E)$ , where $V = {1,2, . . . N}$ is the set of $N$ human joints in space and bone as edges $E$ . $A$ is the set of edges between the joint points of the human body. If joints $i$ and $j$ are physically connected, then $A_{i, j} = 1$ ; if joints i and j are physically disconnected, then $A_{i, j} = 0$ .

Architecture overview

In this paper, a new model Multi-Scale Central Difference Graph Convolution Network (MSCDGCN) is constructed, as shown in Fig. 1. The lower panels of Fig. 1 provide an intuitive visualization of how each module processes skeleton sequences. The model consists of three parts: SACD module to learn the spatial structure of the skeleton data; MSSTC module for temporal modelling, and STJF module to further enhance the correlation of spatial and temporal information. The sparse self-attention mechanism is used in SACD module to capture the spatial dependencies of the input data and further fuses and extracts the spatial features through central difference map convolution. MSSTC module is used to capture the temporal features on different time scales. The final output containing rationally adjusted spatial and temporal features is obtained through joint attention of STJF module. Finally, in order to avoid overfitting, the generated feature map is sent to the global average pooling layer for dimensionality reduction, and then the $Softmax$ function is used for action category prediction.

[See PDF for image]

Fig. 1

MSCDGCN Overall Architecture. The encoder comprises three key modules that infer the intrinsic topological structure of joints, offering contextual insights beyond physical connections. The blue line at the bottom illustrates the learned intrinsic topology. The SACD module (left) extracts spatial features; the MSSTC module (middle) captures temporal dependencies through multi-scale separable temporal convolutions; and the STJF module (right) reduces spatiotemporal redundancy via joint attention, producing optimized behavioral representations

Self-attention central difference module

The SACD module employs sparse self-attention to capture global dependencies in input poses, combines temporal weights to infer human body topology. Additionally, it utilizes central difference graph convolution to focus on the local motion features, and reduces the input noise by calculating the differences between neighboring nodes. The architecture of the SACD module is illustrated in Fig. 2.

[See PDF for image]

Fig. 2

Structure of the SACD module. The blue block generates a global dependency graph, where non-critical connections are filtered out using an adaptive threshold. Subsequently, the processed pose vector is combined with the gradient adjacency matrix to compute directional differences between joints, thereby enhancing the perception of local motion patterns

The sequence of skeleton maps is represented as a joint feature vector $X_{t} \in R^{T \times N \times C}$ , where $T$ and $C$ denote the temporal and feature dimensions, respectively. MSCDGCN transforms $X_{t}$ into a $D$ -dimensional vector via linear projection with learnable parameters ( $D$ is the dimension of the input vector). Positional embeddings $PE$ are then added to incorporate joint positional information, yielding the initial transformed hidden state $H_{t}^{(l)}$ , as shown in Eq. (1).

H_{t}^{(0)} = L i n e a r (X_{t}) + P E

let

H_{t}^{(l)}

denote the hidden state after the

l

-th graph convolution layer, where

H_{t}^{(0)}

(explicitly annotated with layer index

l = 0

) represents the initial transformed hidden state obtained after feature embedding, and

H_{t}^{(0)}, P E \in R^{N \times D}

After integrating positional embeddings $PE$ , the sparse self-attention mechanism extracts the topological structure of the skeletal vector $H_{t}^{(l)}$ , enriching it with contextual dependences. This mechanism attends to all potential relationships between joints and related nodes, filters out non-critical information for spatial structure description, and assigns edge weights to reflect relationship strengths. Specifically, it first computes a self-attention map, then applies temporal weights and sparsity thresholds to infer the skeleton topology, thereby precisely capturing key motion features and dependencies in the pose vector. The inferred topology map is further utilized as neighborhood vertex information for central difference graph convolution.

The joint representation vector $H_{t}^{(l)}$ is linearly projected to $D'$ dimension queries and keys to generate the self-attention map $M_{t}^{' (l)}$ , as defined in Eq. (2).

M_{t}^{' (l)} = \frac{H_{t}^{(l)} W_{K}^{(l)} {(H_{t}^{(l)}, W_{Q}^{(l)})}^{T}}{\sqrt{D'}}

where

W_{K}^{(l)}, W_{Q}^{(l)} \in R^{D \times D^{'}}

is the learnable matrix and

D' = D / 8

The design time window size is $p$ , the current time step is $t$ , and the range of time steps within the time window is $[t - p + 1, t]$ . The self-attention map $M_{t}^{' (l)}$ is reshaped according to the time window to obtain the weighted attention score $M_{t}^{(l)}$ as shown in Eq. (3).

M_{t}^{(l)} = σ (\sum_{T = t - p + 1}^{t}, M_{T}^{' (l)}, W_{T}^{(l)})

where

W_{T}^{(l)}

is the learnable temporal weight matrix,

σ

represents the activation function.

The threshold $τ^{(l)}$ for the current input vector is computed from $M_{t}^{(l)}$ via Eq. (4).

τ^{(l)} = σ (l, i, n, e, a, r, (m, a, x, p, o, o, l, (M_{t}^{(l)})))

where

maxpool

denotes the maximum pooling operation,

linear

is the linear transformation.

Denote ${\tilde{M}}_{t}^{(l)}$ as the sparse self-attention score of the pose vector $X_{t}$ at moment $t$ , as shown in Eq. (5).

{\tilde{M}}_{t}^{(l)} = σ (M_{t}^{(l)} - τ^{(l)})

Additionally, a shared topology across time and instances is combined within the sparse self-attention mechanism, where a shared topology is a shared and fixed structure of skeleton connections across multiple time steps or instances. The shared topology and sparse self-attention map comprise $M$ heads. For each head $m$ ( $1 \leq m \leq M$ ), the shared topology graph $A_{m} \in R^{N \times N}$ and the sparse self-attention map ${\tilde{M}}_{t}^{(l)}$ are combined to derive the intrinsic topology, which reflects valid connectivity between joints at the current timestep.

Denote ${\tilde{H}}_{t}^{(l)}$ as the transformed input pose vector at moment $t$ , as given by Eq. (6)

{\tilde{H}}_{t}^{(l)} = A_{m} ⊙ {\tilde{M}}_{t}^{(l)}

where

⊙

denotes the element-by-element multiplication of the broadcast and

M = 3

The updated vector $H_{t}^{(l + 1)}$ after graph convolution is expressed in Eq. (7).

H_{t}^{(l + 1)} = σ ({\hat{A}}_{t}^{(l)}, {\tilde{H}}_{t}^{(l)}, W^{(l)})

where

H_{t}^{(l + 1)}

is the vector representation after graph convolutional,

{\hat{A}}_{t}^{(l)}

is the normalized adjacency matrix (

{\hat{A}}_{t}^{(l)} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}

, where

A

is the adjacency matrix),

I

is the identity matrix, and

D

is the diagonal matrix of

A + I

W^{(l)} \in R^{D^{(l)} \times D^{(l + 1)}}

is the learnable parameter matrix at layer

l

D^{(l)}

and

D^{(l + 1)}

denote the feature dimensions of the

l

th and

l + 1

th layers, respectively.

While graph convolution effectively aggregates neighborhood information, it primarily relies on adjacency and feature matrix operations, which may lack sensitivity to local node variations and struggle to capture fine-grained features. In contrast, Central Difference Graph Convolution (CDGC) enhances discriminative power by incorporating locally focused directional features into the graph convolution process, thereby improving representational capacity and generalization. As visualized in Fig. 3a, the SACD module computes directional gradients within a localized sampling region (orange area) centered on each joint. These gradients enhance the model’s ability to detect subtle motion cues crucial for fine-grained action recognition. Figure 3b further illustrates how these directional differences are aggregated into the graph convolution process, enabling the network to propagate spatial topology information while preserving motion directionality.

[See PDF for image]

Fig. 3

Illustration of the Central Difference Graph Convolution (CDGC) process. (a) Sampling region of a central node (black) and its neighboring nodes (orange), where directional differences are computed to capture local motion gradients. (b) Update process of the central node and its neighbors via directional difference aggregation, enhancing sensitivity to fine-grained joint movements and local spatial structure

In this work, CDGC processes neighborhood information generated by sparse self-attention to produce the output vector $H^{(l + 1)}$ , as formulated in Eq. (8).

H_{t}^{(l + 1)} = σ (({\hat{A}}_{t}^{(l)} {\tilde{H}}_{t}^{(l)} - α \cdot {\overset{⌣}{A}}_{t}^{(l)} ⊙ {\tilde{H}}_{t}^{(l)}), W^{(l)})

where

{\hat{A}}_{t}^{(l)}

is the normalized adjacency matrix at moment

t

{\overset{⌣}{A}}_{t}^{(l)} = \bar{A} ⊙ 1

\bar{A}

is the

N \times 1

vector obtained by summing the second dimension of the adjacency matrix

A

, and

1

is the

1 \times C

dimension vector. The hyperparameter

α \in [0,1]

balances contributions between node and gradient information.

Multi-scale separable temporal convolution module

To capture the temporal characteristics of the human skeleton, this study introduces MSSTC module, depicted in Fig. 4. The MSSTC module includes three branches, each with varying convolution kernel sizes and dilation rates, and the dashed part of the dotted line represents the branch in which the Hourglass Shaped Separable Convolution (HSSC) structure is used. The first two branches first obtain the convolutional feature map $T_{s}$ by HSSC, followed by new feature Maps by expansion convolution operations with expansion rates of 1 and 2, respectively, with different expansion rates controlling the range of the convolution's operation on the time series data. The third branch then extracts the most Significant feature information of the input vectors by 3× 1 Max pooling and 1× 1 convolution operation.

[See PDF for image]

Fig. 4

MSSTC module. The module employs an Hourglass-Shaped Separable Convolution (HSSC) structure combined with dilated convolution to capture temporal features, ultimately outputting multi-scale temporal representations through weighted fusion

The output $H_{t}^{(l + 1)}$ of the SACD module is used as the input of the MSSTC module, and the output convolutional feature map $T_{s}$ is computed by separable convolution as shown in Eq. (9).

T_{s} = C o n v (σ, (H_{t}^{(l + 1)})) + R e s i d u a l (H_{t}^{(l + 1)})

where

Conv

represents two layers of deep convolution and one layer of point convolution operations,

Residual

is residual connection.

The output after multi-scale separable time convolution processing is $T_{o}$ as shown in Eq. (10).

T_{o} = σ (\sum_{i = 1}^{3}, {Branch}_{i}, (C o n v 2 d, (T_{s})) + R e s i d u a l (T_{s})

where Conv2d is temporal convolution, which extracts features by applying a convolution kernel in both dimensions of the input skeletal image,

{Branch}_{i}

is the output of branch

i

, and

Residual

is the residual join.

Unlike traditional convolution, separable temporal convolution combines deep and point-wise convolution to reduce computational effort while maintaining high performance. The model adopts an hourglass type separable convolution structure, where deep convolution is used first so that each input channel is only connected to the corresponding output channel to focus on local features, point-by-point convolution integrates the channel information, and deep convolution is used again to further capture the global dependency, thereby lowering model complexity while preserving spatial information.

The proposed module adopts a three-branch architecture, where the large-dilation-rate branch (dilation = 3) captures long-term temporal dependencies, expanding the receptive field to mitigate the locality constraints of standard convolutions; the medium-dilation-rate branch (dilation = 2) models intermediate-range temporal relationships, balancing local detail preservation with global context aggregation; and the non-dilated branch (dilation = 1) extracts short-term motion features, enabling precise localization of transient dynamics.

Spatial temporal joint focus module

Attention-based methods for skeleton movement recognition often calculate attention scores separately for each channel or spatial dimension, while other dimensions rely on global averaging for processing. For example, in the spatial dimension, if one focuses on the motion trajectory of a particular joint, the attention score for that joint at all time steps may be computed, while for other joints, it may be processed using global averaging. This study introduces the STJF module to identify the most significant joint points across the entire skeleton sequence. Figure 5 provides a detailed visual representation of the STJF module’s dual-path mechanism. The temporal compression path captures global motion patterns across frames, allowing the model to focus on significant temporal events. Simultaneously, the spatial compression path highlights key joint activations relevant to the action. These two attention pathways are then fused through an adaptive weighting mechanism, ensuring precise spatial–temporal feature alignment for robust action recognition.

[See PDF for image]

Fig. 5

Architecture of the Spatial–Temporal Joint Focus (STJF) module. The left path compresses temporal information by aggregating frame-wise dynamics, while the right path compresses spatial information across joints. Both compressed representations are then fused through attention scoring to highlight critical joint-frame interactions, ensuring fine-grained spatial–temporal feature alignment

Firstly, the output $T_{o}$ of MSSTC module is average pooled at frame level and joint level respectively, and then the pooled feature vectors are spliced together to obtain $f_{inner}$ by information compression through the fully connected layer as shown in Eq. (11).

f_{inner} = θ (({AvgPool}_{t} (T_{o}) \oplus {AvgPool}_{v} (T_{o})) \cdot W)

where

{AvgPool}_{t}

and

{AvgPool}_{v}

are the average pooling operations at the frame level and joint level, respectively,

\oplus

denotes matrix addition,

W \in R^{C \times \frac{C}{r}}

is the learnable parameter matrix (parameter

r

is used for compression of features), and

θ

is the

HardSwish

activation function.

Next, two separate FC layers are used to derive frame-level and joint-level attention scores. The attention fraction matrix of frame and joint level is calculated to obtain the attention fraction $f_{out}$ of the entire action sequence, as shown in Eq. (12).

f_{out} = f_{in} ⊙ (σ (f_{inner} \cdot W_{t}) ⊙ σ (f_{inner} \cdot W_{v}))

where ⊙ denotes the element-by-element multiplication of the broadcast,

{pool}_{t}

and

{pool}_{v}

are the average pooling operations on the frame level and joint level, respectively,

W_{t} \in R^{C \times \frac{C}{r}}

and

W_{v} \in R^{C \times \frac{C}{r}}

are the learnable parameter matrices (the parameter

r

is used for compression of the features).

Training loss

The loss function adopted in this study is designed based on the learning objective of the Information Bottleneck (IB) principle. The IB principle provides a theoretical framework for learning compressed yet information-rich representations of data. Skeletal sequences inherently exhibit high dimensionality and noise due to sensor errors, occlusions, or irrelevant joint movements. The IB loss addresses these challenges by suppressing non-discriminative joints and compressing task-irrelevant spatial details to learn viewpoint-invariant representations. The proposed loss function combines two variants of IB: the Variational Information Bottleneck (VIB) (Zaidi and Aguerri 2020) and the Conditional Entropy Bottleneck (CEB) (Fischer 2020). Specifically, VIB is used to minimize the mutual information between the latent representation $Z$ and the input $X$ , which helps suppress redundant information and prevent overfitting. In parallel, CEB encourages the preservation of information relevant to the output $Y$ , thus improving the separability of features across classes. The combined loss formulation enhances both generalization and robustness, especially in complex scenarios with noisy or redundant skeleton features.

Thus, the IB objective aims to design a stochastic latent variable $Z$ , which contains compressed information about the input variable $X$ (Skeleton Sequences) while retaining maximum information about the target variable $Y$ (Action Labels). Assuming the relationships among variables follow the graphical model $Z \leftarrow X \leftrightarrow Y$ , the accessible content is the stochastic encoder $p (z ; x)$ , which maximizes the IB objective $R (Z)$ as shown in Eq. (13).

R (Z) = L (Z ; Y) - γ_{1} L (Z ; X) - γ_{2} L (Z ; X | Y)

where

γ_{1}

and

γ_{2}

are control parameters, and

L (;)

denotes mutual information.

In Eq. (13), the first term $L (Z ; Y)$ forces $Z$ to provide sufficient information to predict $Y$ . The second term ensures $Z$ remains concise while retaining information about the input $X$ . The third term compresses the latent variable $Z$ conditioned on the class labels.

This study employs tractable variational bounds to estimate the mutual information in $R (Z)$ . First, a variational classifier $q (y | z)$ is utilized to derive the variational lower bound of $L (Z ; Y)$ . Then, following (Fischer 2020), the paper define $r (z)$ and $r (z | y)$ as the variational marginal and conditional marginal distributions, respectively, to obtain the variational upper bounds of $L (Z ; X)$ and $L (Z ; X | Y)$ . By transforming Eq. (13), wderive Eq. (14) as follows:

R (Z) \geq E_{p (x) p (z | x)} [l, o, g, q, (y | z)] - γ_{1} E_{p (x) p (z | x)} [l, o, g, \frac{p (z | x)}{r (z)}] - γ_{2} E_{p (x) p (z | x) p (y | x)} [l, o, g, \frac{p (z | x)}{r (z | y)}]

Formula (14) can be approximated by the empirical loss of the prediction network combining the encoder and the classifier, and the resulting loss function is shown in Eq. (15).

L_{TOTAL} = - \frac{1}{|D|} \sum_{x_{i}, y_{i} \in D} E_{p (Z | X_{i})} [l, o, g, q, (y_{i}, | z)] + γ_{1} {∣ ∣ μ_{p (z)} - μ_{r (z)} ∣ ∣}_{2}^{2} + γ_{2} \sum_{y} {∣ ∣ μ_{p (z | y)} - μ_{r (z | y)} ∣ ∣}_{2}^{2}

where

D = {(x_{i}, y_{i})}

is the given dataset; the first term uses a variational classifier

q (y | z)

to compute the variational lower bound; the second and the third terms are marginal Maximum Mean Discrepancy (MMD) loss,

μ_{p (z)} = \frac{1}{|D|} \sum_{x_{i}, y_{i} \in D} E_{p (z |, x_{i})} [z]

is the mean of the stochastic encoder’s latent distribution,

μ_{r (z)}

is the mean of the variational marginal distribution

r (z)

μ_{p (z | y)} = \frac{1}{|D_{y}|} \sum_{x_{i}, y_{i} \in D} E_{p (z |, x_{i})} [z]

, where

D_{y} = {(x_{i}, y_{i} | y_{i} = y)}

μ_{r (z | y)}

is the mean of the variational conditional marginal distribution

r (z | y)

To improve mathematical clarity, the symbols used in Sect. 3.2 are summarized in Table 1 for quick reference.

Table 1. Notations used in Sect. 3.2

Symbol	Definition
$G = (V, E)$	Skeleton graph, where $V$ is the set of joints (nodes) and $E$ is the set of bones (edges)
$A \in R^{N \times N}$	Original adjacency matrix of the skeleton graph (physical bone connections)
$N$	Number of joints per skeleton
$T$	Number of frames in a skeleton sequence
$C$	Feature dimensions
$D$	The dimension of the input vector
$D^{'}$	Dimension of hidden features after projection
$X_{t} \in R^{T \times N \times C}$	Input skeleton feature matrix, with $T$ frames, $N$ joints, and $C$ -dimensional features
$H_{t}^{(l)}$	Hidden representation at the $l$ -th graph convolutional layer
$H_{t}^{(0)}$	Initial hidden state after linear projection and positional embedding
$PE$	Positional embedding vector encoding temporal and joint position information
$W_{K}^{(l)}, W_{Q}^{(l)} \in R^{D \times D'}$	Learnable matrix
$M_{t}^{' (l)}$	Initial self-attention map
$M_{t}^{(l)}$	The weighted attention score map
$W_{T}^{(l)}$	The learnable temporal weight matrix
$τ^{(l)}$	The threshold for the current input vector is computed
${\tilde{M}}_{t}^{(l)}$	the sparse self-attention score of the pose vector
$A_{m} \in R^{N \times N}$	The shared topology graph
$I \in R^{N \times N}$	Dentity matrix (self-connections)
$\hat{A} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}$	Normalized adjacency matrix, where D is the degree matrix of A + I
$\overset{⌣}{A}$	Normalized adjacency matrix, ${\overset{⌣}{A}}_{t}^{(l)} = \bar{A} ⊙ 1$
$\bar{A}$	$N \times 1$ vector obtained by summing the second dimension of the adjacency matrix $A$
${\tilde{H}}_{t}^{(l)}$	The transformed input pose vector
$W^{(l)} \in R^{D^{(l)} \times D^{(l + 1)}}$	The learnable parameter matrix at layer $l$
$α \in [0,1]$	The hyperparameter balances contributions between node and gradient information
$T_{s}$	The output convolutional feature map is computed by separable convolution
$T_{o}$	The output after multi-scale separable time convolution processing
${AvgPool}_{t}$	The average pooling operations at the frame level
${AvgPool}_{v}$	The average pooling operations at the joint level
$W \in R^{C \times \frac{C}{r}}$	The learnable parameter matrix (parameter $r$ is used for compression of features)
$W_{t}, W_{v} \in R^{C \times \frac{C}{r}}$	The learnable parameter matrices
$Z$	Latent variable
$X$	Input variable (Skeleton Sequences)
$Y$	Target variable (Action Labels)
$p (z ; x)$	The stochastic encoder
$μ_{p (z)}$	The stochastic encoder’s latent distribution
$μ_{r (z)}$	The variational marginal distribution $r (z)$

Experiments

To validate the effectiveness of the proposed model MSCDGCN, experiments were carried out on two large-scale datasets alongside baseline methods, with ablation studies conducted to assess the impact of individual components.

Datasets

The NTU RGB + D 60 (Shahroudy et al. 2016) dataset contains 56,880 skeletal sequences performed by 40 different participants and 60 different movement categories. The dataset was captured from multiple angles by three cameras, ensuring diverse and comprehensive motion data. In the dataset, samples of skeletal motion sequences were captured by the Kinect depth sensor. Specifically, the sensors extract three-dimensional (3D) positional information about the human skeleton from each frame of the image, and each sample contains up to two tracked objects. The dataset uses two evaluation segmentation criteria to test and validate the performance of the human action recognition model:

Cross-Subject (X-sub): The dataset is divided into a training set and a validation set. For 40 subjects' movement categories, the training set includes 40,320 samples from 20 performers, while the test set contains 16,560 samples from the remaining 20 performers. This setup evaluates the model's generalization across different performers.
Cross-View (X-view): Two camera views are used for training, and the third is reserved for validation. The training set includes 37,920 action samples captured at 0° and 45° angles, while the validation set comprises 18,960 samples from the −45° camera angle.

NTU RGB + D 120 (Liu et al. 2019) expands on NTU RGB + D 60 by adding 57,367 new motion samples. The dataset includes 114,480 skeletal movement samples across 120 action categories, performed by 106 subjects, and recorded from multiple angles using three cameras. The dataset provides two evaluation criteria:

Cross-Subject (C-sub): The actions performed by 106 subjects were split into a training set and a validation set, with 63,026 samples from 53 subjects used for training and 50,922 samples from the remaining 53 subjects reserved for testing.
Cross-Setup (C-set): The training set and test set are distinguished based on the numbering of the cameras. The 54,471 skeleton movement sequence samples collected from even-numbered cameras are designated as the training set, and the 59,477 skeleton movement sequence samples from odd-numbered cameras constitute the test set. The dataset visualization is shown in Fig. 6.

[See PDF for image]

Fig. 6

dataset visualization presentation

Kinetics (Carreira and Zisserman 2017) is a large-scale human action recognition dataset consisting of approximately 300,000 video clips across 400 categories, providing only RGB data. Reference (Yan et al. 2018) estimated the 2D coordinates and confidence scores of 18 body joints to create the Kinetics-Skeleton dataset. This version functions as one of the evaluation benchmarks presented in this paper.The dataset is split into 240,000 clips for training and 20,000 clips for validation.

Results

All experiments in this work are performed using the Pytorch framework on NVIDIA GeForce RTX 3060 GPUs. For NTU RGB + D 60 & 120, preprocessing follows the method in (Zhang et al. 2020), with skeleton spine alignment achieved through view-invariant transformations. Each skeleton consists of 25 joints, and the datasets include 60 and 120 action categories, respectively. To infer latent vectors $Z$ , the model employs three fully connected (FC) layers. The first FC layer transforms pooled features from encoded blocks into latent dimension vectors, while the MeanFC and CovFC layers model the conditional latent distribution by generating the mean and covariance of a multivariate Gaussian distribution. Training spans 110 epochs, with a 5-epoch preheating phase (Chen et al. 2021a, b). The learning rate is set to 0.01, with a weight decay of 5 × 10⁻⁴ and loss coefficients $γ_{1}$ = 1 × 10–4 and $γ_{2}$ = 1 × 10–1.

This research adopts a multi-modal fusion approach for experimental validation. The model is trained using both individual and combined modal representations, leveraging joint relative positions to enrich complementary features. Four modalities are utilized for evaluation: "joint flow" (J) representing joint coordinates, "bone flow" (B) capturing coordinate differences between connected joints, "joint motion" (J-M) reflecting coordinate differences across consecutive frames, and "bone motion" (B-M) indicating bone differences between adjacent frames. The integration of all four modalities is referred to as 4-stream (4S). In line with established protocols, the study evaluates the MSCDGCN using J, B, J + B, and 4S configurations.

The training results of MSCDGCN were compared with the state-of-the-art methods SGCN, Shift-GCN, DC-GCN + ADG, Dynamic GCN, MS-G3D, MST-GCN, 2S-AGCN, CTR-GCN, InfoGCN were compared and the results of these four integrations are reported in Tables 2 and 3. * denotes the results obtained in this paper by retraining the model using its officially released code. To ensure fair and rigorous evaluation, all baselines were retrained under the same experimental setup as MSCDGCN.

Table 2. Experimental results on NTU RGB + D 60

Dataset	NTU RGB + D 60
Setting	X-Sub				X-View
Method/Modality	J	B	J + B	4S	J	B	J + B	4S
SGCN (Zhang et al. 2020)	-	-	89.0	-	-	-	94.5	-
Shift-GCN (Cheng et al. 2020a, b)	87.8	-	89.7	90.7	95.1	-	96.0	96.5
DC-GCN + ADG (Cheng et al. 2020a, b)	-	-	90.8	-	-	-	96.6	-
Dynamic GCN (Ye et al. 2020)	-	-	-	91.5	-	-	-	96.0
MS-G3D (Liu et al. 2020)	89.4	90.1	91.5	-	95.0	95.3	96.2	-
MST-GCN (Chen et al. 2021a, b)	89.0	89.5	91.1	91.5	95.1	95.2	96.4	96.6
2S-AGCN (Shi et al. 2019)	-	-	88.5	-	93.7	93.2	95.1	-
CTR-GCN (Chen et al. 2021a, b)	-	-	-	92.4	-	-	-	96.8
Action-Embed-TR (Ahmad et al. 2023)	-	-	-	91.5	-	-	-	96.8
InfoGCN (Chi et al. 2022)	89.8	90.6	91.6	92.7	95.2	95.5	96.5	96.9
InfoGCN*	89.4	90.6	91.3	92.3	95.2	95.4	96.2	96.7
MSCDGCN	90.3	91.5	92.3	92.8	95.3	95.6	96.6	96.8

Table 3. Experimental results on NTU RGB + D 120

Dataset	NTU RGB + D 120
Setting	C-Sub				C-Set
Method/Modality	J	B	J + B	4S	J	B	J + B	4S
SGCN (Zhang et al. 2020)	-	-	79.2	-	-	-	81.5	-
Shift-GCN (Cheng et al. 2020a, b)	80.9	-	85.3	85.9	83.2	-	86.6	87.6
DC-GCN + ADG (Cheng et al. 2020a, b)	-	-	86.5	-	-	-	88.1	-
Dynamic GCN (Ye et al. 2020)	-	-	-	87.3	-	-	-	88.6
MS-G3D (Liu et al. 2020)	-	-	86.9	-	-	-	88.4	-
MST-GCN (Chen et al. 2021a, b)	82.8	84.8	87.0	87.5	84.5	86.3	88.3	88.8
2S-AGCN (Shi et al. 2019)	-	-	-	-	-	-	-	-
CTR-GCN (Chen et al. 2021a, b)	-	85.7	88.7	88.9	-	87.5	90.1	90.6
Action-Embed-TR (Ahmad et al. 2023)	-	-	-	87.7	-	-	-	88.5
InfoGCN (Chi et al. 2022)	85.1	87.3	88.5	89.4	86.3	88.5	89.7	90.7
InfoGCN*	84.2	86.9	88.2	89.2	86.3	88.5	89.4	90.7
MSCDGCN	85.7	88.0	89.0	89.5	86.8	89.4	90.6	91.0

Table 2 provides a detailed comparison of action recognition results on the NTU RGB + D 60. The proposed MSCDGCN model demonstrates superior accuracy across multiple modalities compared to other methods. In the X-Sub evaluation, using the J modality, the MSCDGCN model achieves a 0.9% accuracy improvement in action recognition over the top-performing models, MS-G3D and InfoGCN, on the same modality. On the B modality, MSCDGCN also achieves significant outperformance over models such as MS-G3D. On the J + B modality, the accuracy improvement of MSCDGCN in recognizing actions is even more Significant, ranging from 0.8% to 4%. In the 4S modality, the accuracy of MSCDGCN recognizing actions is also steadily improved compared to the comparison methods. In X-View evaluation, the model MSCDGCN also shows strong generalization ability. Although the accuracy of MSCDGCN in recognizing actions is slightly lower than that of model ST-TR-GCN in the J modality, it achieves to outperform models such as ST-TR-GCN, Shift-GCN, and MS-G3D in all other modalities. The improvement of MSCDGCN accuracy is especially obvious in J + B modality and 4S modality. On the most comprehensively integrated 4S modality, the accuracy of model MSCDGCN in recognizing actions is 0.3%, 0.8%, 0.2%, and 0.1% higher than that of models Shift-GCN, Dynamic GCN, MST-GCN, and InfoGCN, respectively, which further proves the superiority of this paper's model in dealing with complex action recognition tasks.

Table 3 presents the comparison results on the NTU RGB + D 120 dataset. The proposed MSCDGCN model achieves outstanding performance in both C-Sub and C-Set evaluations. The specific performance in C-Sub evaluation is as follows: on J modality, the accuracy of model MSCDGCN in recognizing actions Significantly outperforms that of models ST-TR-GCN, Shift-GCN, MST-GCN, and InfoGCN, with an improvement of 3.0%, 4.8%, 2.9%, and 1.5%, respectively. On the B modality, MSCDGCN recognizes actions with the same better accuracy than the models MST-GCN, CTR-GCN and InfoGCN, with accuracy improvements of 3.2%, 2.3% and 1.1%, respectively. In the J + B modality, MSCDGCN has an outstanding performance advantage, with a Significant improvement in accuracy compared to models such as SGCN, CTR-GCN, and InfoGCN. On the 4S modality, MSCDGCN recognizes actions with 3.6%, 2.2%, 2.0%, 0.6%, and 0.3% higher accuracy than the models Shift-GCN, Dynamic GCN, MST-GCN, CTR-GCN, and InfoGCN, respectively. The model MSCDGCN also Maintains excellent performance in the C-Set evaluation. Specifically, on the J modality, MSCDGCN accuracy is about 0.5% to 3.6% higher than models such as ST-TR-GCN; on the B modality, it is about 0.9% to 3.1% higher than models such as MST-GCN; on the J + B modality, the improvement in accuracy is especially Significant, which is about 0.5% to 9.1% higher than models such as SGCN; and on the 4S modality, the MSCDGCN recognizes action The accuracy of MSCDGCN in recognizing actions in 4S modalities is also 3.4%, 2.4%, 2.2%, 0.4%, and 0.3% higher than the models Shift-GCN, Dynamic GCN, MST-GCN, CTR-GCN, and InfoGCN, respectively. In summary, the MSCDGCN achieves significantly higher action recognition accuracy across all four modalities in both C-Sub and C-Set evaluations, demonstrating its effectiveness and advanced design.

Although some of the accuracy improvements reported in Tables 2 and 3 are relatively small (e.g., below 1%), these gains have been confirmed to be statistically significant through paired t-tests, as detailed in Sect. 4.5. This ensures that the observed improvements are not due to random variation and reflect meaningful performance advantages. In addition, more substantial gains (2–3%) are observed in fine-grained action pairs (see Tables 5 and 7), further demonstrating the practical value of the proposed framework.

To access model efficiency, this study analyzes accuracy and complexity on X-sub of the NTU RGB + D 60, with results summarized in Table 4. As shown in the table, compared to the initial GCN baseline (SGCN), the proposed model achieves a 3.8% accuracy improvement while reducing FLOPs by 3.06 times, despite a 2.5-fold increase in parameters. Although MSCDGCN has more parameters than SGN, CTR-GCN, and InfoGCN, it demonstrates faster computation, with FLOPs reduced by 3.06, 1.10, and 1.04 times, respectively.

Table 4. Accuracy (%), FLOPs (G) and number of parameters (M) on X-sub

Method	ACC	FLOPs(G)	PARAMS(M)
SGCN (Zhang et al. 2020)	89.0	5.48	0.69
Shift-GCN (Cheng et al. 2020a, b)	90.7	10.0	2.76
DC-GCN + ADG (Cheng et al. 2020a, b)	90.8	1.83	4.96
Dynamic GCN (Ye et al. 2020)	91.5	1.99	14.40
MS-G3D (Liu et al. 2020)	91.5	5.22	2.80
MST-GCN (Chen et al. 2021a, b)	91.5	1.82	12.00
2S-AGCN (Shi et al. 2019)	88.5	37.32	6.94
CTR-GCN (Chen et al. 2021a, b)	92.4	1.97	1.50
Action-Embed-TR (Ahmad et al. 2023)	91.5	18.3	-
InfoGCN (Chi et al. 2022)	92.3	1.84	1.60
MSCDGCN	92.8	1.79	1.70

In addition, this paper included the recent transformer-only framework Action-Embed-TR in comparison. However, it’s purely attention-based design comes with increased computational overhead and does not explicitly utilize the skeletal topology. MSCDGCN outperforms this transformer-only baseline by 1.3% on NTU60 and 1.8% on NTU120, highlighting the benefits of integrating graph convolution with attention. These results indicate that combining structural priors with global temporal reasoning provides a more balanced and effective solution for skeleton-based action recognition.

The experimental results on two widely used action recognition datasets show that MSCDGCN's action recognition accuracy rates in most of the four modalities significantly outperform those of the other comparative methods, which proves the validity and sophistication of MSCDGCN.

Module performance analysis

MSCDGCN achieves significant performance improvements on skeleton-based action recognition tasks by systematically addressing key limitations of existing methods, including directional insensitivity, computational inefficiency, and suboptimal spatio-temporal fusion. In this section, InfoGCN serves as a benchmark to demonstrate the advantages of MSCDGCN. The InfoGCN baseline is a powerful model that has integrated graph convolution and self-attention mechanisms.

Traditional GCNs apply equal aggregation over all neighboring nodes, limiting the model's ability to capture direction-aware motion patterns. Although InfoGCN incorporates self-attention, it tends to over-focus on irrelevant joints in complex scenes. To address these issues, MSCDGCN introduces the Self-Attention Central Difference (SACD) module, which uses adaptive thresholds to implement sparse self-attention and dynamically filter key joints, thereby reducing interference from non-informative nodes. Moreover, it incorporates central difference graph convolution to compute directional gradients between joints, enhancing local motion sensitivity. As shown in Table 5, adding SACD to InfoGCN improves accuracy to 92.6%, reduces FLOPs to 1.81G, and slightly reduces parameters.

Table 5. Compares the accuracy, FLOPs and PARAMS of InfoGCN with those after adding each module of this model (NTU RGB + D 60 X-Sub)

Method	ACC(%)	FLOPs (G)	PARAMS(M)
InfoGCN (Baseline)	92.3	1.84	1.60
InfoGCN + SACD	92.6	1.81	1.58
InfoGCN + MSSTC	92.3	1.80	1.58
InfoGCN + STJF	92.6	1.84	1.62

Standard temporal convolution suffers from large parameter counts and limited long-range modeling capability. MSCDGCN adopts HSSC structure to preserve feature extraction capacity while significantly reducing computation. In addition, it designs a multi-branch dilated convolution setup to capture multi-scale temporal dependencies via differentiated dilation rates. As indicated in Table 5, integrating the MSSTC module reduces FLOPs by 0.04G and parameters by 0.02 M, confirming its role in enhancing computational efficiency.

Most existing methods compute spatial and temporal attention separately, leading to a loss of interactive information across dimensions. MSCDGCN addresses this through the STJF module, which synchronously optimizes attention across both joints and frames, mitigating information fragmentation. Although adding STJF increases parameter count slightly, accuracy improves to 92.6%.

By combining SACD and MSSTC, the complete MSCDGCN model further boosts accuracy to 92.8%, demonstrating the synergistic effect of its components. Notably, separable convolution combined with sparse SACD computation achieves a 2.7% reduction in FLOPs compared to the baseline. The directional awareness from SACD and coordinated spatio-temporal optimization by STJF together yield a 0.5% gain in accuracy. While the parameter count is marginally higher than the baseline, it remains significantly lower than that of multi-stream architectures.

In Eq. (8), the hyperparameter $α$ controls the contribution of directional gradient information in the CDGC. To assess the model's sensitivity to $α$ , experiments were conducted under the on the NTU RGB + D 60 X-Sub with varying $α$ values. The results are presented in Table 6. When $α$ =0 (i.e., CDGC degenerates into a standard graph convolution), an accuracy of 92.3% is achieved under the X-Sub setting. As $α$ increases, the recognition performance gradually improves, reaching a peak accuracy of 92.8% at $α$ =0.3. These results demonstrate that a moderate incorporation of spatial gradient cues can effectively enhance action recognition. Based on this analysis, $α$ =0.3 is adopted as the default value for all subsequent experiments.

Table 6. Impact of different α values on the accuracy of CDGC (NTU RGB + D 60 X-Sub)

α	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
Cross-Subject Accuracy (%)	92.3	92.4	93.6	92.8	92.7	92.7	92.6	92.5	92.5	92.3	92.1

As shown in Table 7, CDGC significantly outperforms CTR-GCN in distinguishing fine-grained action pairs such as "putting on a shoe" vs. "taking off a shoe," with an average improvement of 3.0% in accuracy. This performance gain is primarily attributed to the CDGC’s ability to model dynamic motion gradients, which effectively capture subtle directional reversals inherent in fine-grained action categories.

Table 7. Comparison of MSCDGCN and CTR-GCN on Fine-Grained Action Pairs (NTU RGB + D 60 X-Sub)

Method	Put on glasses VS Take off glasses (ACC%)	Put on a shoe VS Take off a shoe (ACC%)
CTR-GCN	85.3 vs. 83.7	89.1 vs. 87.5
MSCDGCN	88.1 vs. 86.9	92.0 vs. 90.6

Method

Put on glasses VS

Take off glasses (ACC%)

Put on a shoe VS

Take off a shoe (ACC%)

CTR-GCN

85.3 vs. 83.7

89.1 vs. 87.5

MSCDGCN

88.1 vs. 86.9

92.0 vs. 90.6

Ablation studies

To assess the contribution of each module, ablation experiments are conducted. Table 8 shows the accuracy results under the C-sub on NTU RGB + D 120 after removing specific modules: NO_SACD (removing the SACD module), NO_MSSTC (removing the MSSTC module), and NO_STJF (removing the STJF module). Comparing the complete model with these variations helps analyze the impact of each module on overall performance.

Table 8. Accuracy of each module under C-sub on NTU RGB + D 120

Methods	C-Sub
Methods	ACC(%)	FLOPs (G)	PARAMS(M)	Inference Speed
NO_SACD	89.2	1.84	1.74	19.8 ms(51 ps)
NO_MSSTC	89.5	1.95	1.75	20.7 ms(48 ps)
NO_STJF	89.4	1.79	1.69	19.1 ms(52 ps)
MSCDGCN	89.5	1.79	1.70	19.1 ms(52 ps)

According to the results in Table 8, removing the SACD module leads to a 0.3% drop in recognition accuracy (from 89.5% to 89.2%) due to the loss of directional information, which weakens the model's ability to distinguish fine-grained actions. This finding highlights the positive contribution of SACD in capturing directional cues to enhance recognition performance. Additionally, the sparse self-attention mechanism in SACD effectively suppresses non-informative nodes, improving computational efficiency. When SACD is removed, FLOPs increase by 3%, and inference time rises by 0.7 ms. Further analysis is provided in Fig. 7, which visualizes the topology inferred by the SACD module. It can be observed that similar poses exhibit different topological structures depending on the context. By introducing temporal weighting, SACD is able to focus on the most critical frames, thereby enabling the model to better distinguish subtle behavior patterns. Moreover, since the flow of information between joints may vary with direction, the central difference graph convolution in SACD enhances directional sensitivity, effectively highlighting the topological relationships between key joints. These spatial features provide a solid foundation for subsequent spatio-temporal joint modeling. In summary, the SACD module plays a crucial role in both joint information propagation and topology inference, making it an indispensable component of the proposed CDMST architecture.

[See PDF for image]

Fig. 7

Divergent Topological Structures for Similar Actions. This figure compares the context-dependent intrinsic topologies inferred by the SACD module for two visually similar actions: put off a hat and salute. Colored Lines represent the inferred topological connections between a specified joint (e.g., hand or foot) and all other joints. Black-Dashed Bounding Boxes highlight regions where similar poses exhibit distinct intrinsic topologies due to contextual differences. Black Arrows denote critical motion directions of primary joints, revealing how directional features drive topological adaptation

As shown in Table 8, removing the MSSTC module does not affect the model’s recognition accuracy, which remains consistent with that of the complete CDMST model. This indicates that the MSSTC module does not directly contribute to performance improvement in terms of accuracy. However, its removal results in an increase of 0.05 M in parameter count, an 8% rise in FLOPs, and a notable increase in inference time to 20.7 ms. These results suggest that the primary contribution of the MSSTC module lies in enhancing computational efficiency rather than improving recognition accuracy. The MSSTC module adopts a Hourglass-Shaped Separable Convolution (HSSC) architecture, designed to significantly reduce FLOPs while preserving feature extraction capacity. To further validate the efficacy of MSSTC in optimizing computational costs, this article conducts a comparative analysis of three distinct types of separable convolutional structures: SEP, BOTTLE (He et al. 2016), and HSSC. As illustrated in Fig. 8, these architectures differ in design and computational efficiency. In Table 9, “MSS-TC (BOTTLE)” refers to the use of a BOTTLE-style separable convolution, while “MSS-TC (SEP)” and “MSS-TC (HSSC)” represent SEP and HSSC-based structures, respectively. The experimental results show that, under the same parameter constraints, the HSSC structure achieves the lowest FLOPs at only 8.12G. By leveraging 1 × 1 convolutions for dimensionality reduction and expansion, HSSC retains essential features while significantly reducing unnecessary computational overhead. In addition, the MSSTC module integrates separable convolutions with dilated convolutions, enhancing the model’s ability to capture long-term temporal dependencies. In summary, although MSSTC has minimal impact on recognition accuracy, it plays a critical role in reducing computational cost through structural optimization, thereby improving the overall efficiency of the model. Thus, MSSTC is essential for achieving a balanced trade-off between performance and computational overhead.

[See PDF for image]

Fig. 8

Comparison of three separable convolution designs used in the MSSTC module: BOTTLE, SEP, and HSSC. HSSC achieves the lowest FLOPs while maintaining accuracy, validating its efficiency in temporal modeling

Table 9. Number of parameters (× 10⁶) and FLOPs (× 10⁹) for different separable convolutional structures in the MSSTC module

Methods	Param.(× 10⁶)	FLOPs(G)
MSSTC (BOTTLE)	0.13	8.66
MSSTC (SEP)	0.14	8.19
MSSTC (HSSC)	0.13	8.12

As shown in Table 8, removing the STJF module causes the model's accuracy to drop to 89.4%, while having no Significant impact on computational cost. This indicates that STJF plays a positive role in improving recognition performance. Although the MSSTC module in CDMST does not directly contribute to accuracy when removed, it Significantly reduces parameter count and computational load, thereby improving model efficiency. Moreover, the SACD module enhances the learning of spatial relationships among joints and integrates effectively with the temporal features captured by MSSTC and STJF. The STJF module, in particular, enables the model to jointly attend to both spatial and temporal dimensions. When STJF is removed individually, the model experiences a 0.1% accuracy drop, confirming its effectiveness in capturing spatiotemporal dependencies. When the SACD, MSSTC, and STJF modules are integrated, the proposed CDMST model is able to focus on the most informative cues while suppressing irrelevant Signals. This allows the model to Maintain robust performance during training and ultimately achieve a high accuracy of 89.5%. These results strongly validate the effectiveness and superiority of the CDMST design.

According to Tables 8 and 10, removing the SACD module alone results in an accuracy of 89.2%, while removing MSSTC alone yields 89.5%. However, when both SACD and MSSTC are removed Simultaneously, the accuracy drops to 89.0%. This means that the accuracy decreases by 0.2% compared to removing SACD alone, and by 0.5% compared to removing MSSTC alone. These results suggest that the STJF module helps compensate for the accuracy loss when SACD or MSSTC is removed.

Table 10. Model recognition performance under extreme conditions with two modules removed (NTU RGB + D 120 C-sub)

Method	ACC(%)
NO_SACD + NO_MSSTC	89.0%
NO_SACD + NO_STJF	88.5%
NO_MSSTC + NO_STJF	89.1%

When both SACD and STJF modules are removed, the accuracy drops to 88.5%, compared to 89.2% when only SACD is removed and 89.4% when only STJF is removed. The combined removal leads to a performance degradation of 0.7% and 0.9%, respectively, indicating a strong synergistic effect between SACD and STJF. Since STJF relies on the spatial representations provided by SACD for effective spatio-temporal fusion, removing both modules severely impairs this fusion process, resulting in a significant accuracy decline.

When MSSTC and STJF are removed together, the accuracy drops to 89.1%, whereas removing MSSTC or STJF individually results in 89.5% and 89.4%, respectively. The combined removal leads to smaller drops of 0.4% and 0.3%, suggesting a weaker synergy between these two modules. This may be attributed to the fact that the remaining SACD module can still provide directional and structural information to partially offset the loss of temporal modeling and fusion.

These results reveal a strong synergy between SACD and STJF. SACD provides enriched directional spatial representations, which serve as critical inputs for the joint spatial–temporal attention performed in STJF. Without SACD, the STJF module loses key spatial cues, weakening the overall spatiotemporal modeling. This interdependence highlights the necessity of both modules for optimal performance.

In summary, SACD and STJF work collaboratively to enhance the extraction and integration of spatio-temporal features, while MSSTC primarily optimizes computational efficiency in the temporal dimension. Therefore, removing both SACD and STJF leads to significant performance degradation due to impaired spatio-temporal representation. In contrast, when MSSTC and STJF are removed, the SACD module still contributes directional cues that help retain acceptable performance.

As shown in Fig. 9, several representative actions are selected from the NTU60 dataset. In Fig. 9a, under the X-Sub protocol, the actions “reading” and “writing” achieve accuracies of 80% and 83%, respectively. Both actions involve fine-grained hand movements, indicating that the proposed model effectively improves recognition performance for hand-centric actions. Additionally, the actions “wiping face” and “making a phone call” achieve 85% and 89% accuracy, respectively. These actions feature prominent hand–head interactions, and the results demonstrate the model's ability to better capture and recognize such spatial interactions. As shown in Fig. 9b, similar performance is observed under the X-View protocol, confirming that the proposed model consistently enhances the recognition accuracy of fine-grained actions across different viewpoints.

[See PDF for image]

Fig. 9

Confusion matrices for partial actions on NTU60

In this study, the impact of different modules on the model performance is evaluated on a combined training set of multi-modal representations. K-sets denote different kinds of modal descriptions. Table 8 indicates that multi-modal representation boosts the variety of input features and the scope of training models, compensating for issues such as incomplete information or noise by combining data from multiple sources, thus improving integration effectiveness.

From Table 11, it can be observed that on NTU RGB + D 120 X-Sub protocol, all 4S modalities are more accurate than B or J modalities alone. Removing the SACD module alone (NO SACD) increased the accuracy of the 4S modality by 1.5% and 4.1% compared to the B and J modalities alone; removing the MSSTC module alone increased the accuracy of the 4S modality by 1.3% and 3.6% compared to the B and J modalities alone; removing the STJF module alone increased the accuracy of the 4S modality by 1.8% and 4.2% compared to the B and J modalities respectively; and on cross-subject assessment, the accuracy of the J + B and 4S modalities increased by 1.7% and 3.8%, respectively, compared to the accuracy of the joints only. This indicates that multi-modal representation boosts the variety of input features and the scope of training models, compensating for issues such as incomplete information or noise by combining data from multiple sources, thus improving integration effectiveness.

Table 11. Individual modal results for each module under X-Sub for NTU RGB + D 120

Methods	Modes	k-set	ACC(%)
NO SACD	B	{1}	87.7
	J	{k}	85.1
	4S	{1, k}	89.2
NO MSSTC	B	{1}	88.2
	J	{k}	85.9
	4S	{1, k}	89.5
NO STJF	B	{1}	87.6
	J	{k}	85.8
	4S	{1, k}	89.4

In addition, real-world skeleton data often contain occlusions or noisy joint detections due to camera angle, lighting, or sensor limitations. Future extensions of MSCDGCN may incorporate robust design elements such as confidence-aware joint weighting or topological graph filtering to enhance resilience against imperfect input.

Statistical significance analysis

To ascertain that the modest performance gains of the proposed model were statistically sound, a rigorous significance test was conducted, ruling out random variation as a cause. To assess the statistical significance of MSCDGCN's average accuracy improvement over the baseline, the following hypotheses are formulated:

Null Hypothesis (H₀): There is no significant difference in average accuracy between MSCDGCN and the baseline.

Alternative Hypothesis (H₁): MSCDGCN achieves a significantly higher average accuracy than the baseline.

The paired t-test was used as the evaluation criterion. The test statistic is defined in Eq. (16).

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

where

\bar{d}

is the mean difference in accuracy,

s_{d}

is the standard deviation of the differences, and n = 5 is the number of independent trials. Table 12 presents the results of the dataset X-Sub.

Table 12. The t-test results of the X-Sub dataset

Method	1	2	3	4	5	Avg. Accuracy	p-value
MSCDGCN	92.3%	92.5%	92.1%	92.4%	92.6%	92.38%	< 0.001
BASELINE	91.6%	91.8%	91.5%	91.7%	91.9%	91.70%	(reference)
DIFFERENCE	+ 0.7%	+ 0.7%	+ 0.6%	+ 0.7%	+ 0.7%	+ 0.68%

With $\bar{d}$ =0.68%, $s_{d}$ =0.05%, t = 30.4, the resulting p-value < 0.001, indicating that the improvement is statistically significant. The same procedure was applied to other datasets, and in all cases, the null hypothesis was rejected, confirming that the performance improvements of MSCDGCN are statistically significant.

Although the observed improvements in overall accuracy appear relatively small—typically ranging from 0.1% to 0.3%—they are both statistically significant and practically meaningful. Given the large scale of benchmark datasets such as NTU RGB + D 60&120, even a 0.3% increase translates into a considerable number of correctly classified samples. Moreover, as demonstrated in Table 8, the proposed method yields notable improvements of 2.0% to 3.0% in distinguishing fine-grained action pairs (e.g., “put on a shoe” vs. “take off a shoe”, “put on glasses” vs. “take off glasses”), where subtle motion differences are critical for accurate recognition. These results indicate that marginal gains in average accuracy often reflect substantial performance enhancements in challenging, semantically similar categories. Considering the saturated performance of existing state-of-the-art models, such improvements are non-trivial and substantiate the effectiveness of the proposed framework in real-world scenarios.

Practical application of the model

The real-time capability of the proposed model for practical deployment was evaluated using randomly collected skeleton sequence data from real-world scenarios as test samples. The model was deployed and evaluated on NVIDIA GeForce RTX 3060 GPUs. As shown in the Table 13, the inference speed reaches 19.1 ms on the NTU RGB + D 60 & 120 datasets, and 19.8 ms on the real-world skeleton test sequences, demonstrating that the model satisfies the requirements for real-time performance.

Table 13. Inference speed of MSCDGCN on different datasets

Datasets	Inference Speed
NTU RGB + D 60&120	19.1 ms(52 ps)
Self-test sample	19.8 ms(51 ps)

The error rate on our collected real-world test set of 250 samples was approximately 12.3%, which is higher than on NTU but expected due to the noisiness of in-the-wild pose estimation. These errors were predominantly due to the aforementioned occlusion and estimation noise. Future work will explicitly address these limitations by incorporating robust training techniques such as skeleton data augmentation with random joint masking and integrating confidence scores from the pose estimator to down-weight unreliable joints.

The generalization ability of the proposed model was assessed through cross-dataset experiments.As shown in the Table 14, the model achieved an accuracy of 39.8%, demonstrating its ability to transfer knowledge across domains. In addition, the model was tested on randomly collected real-world samples. Figure 10 illustrates example test sequences for actions such as putting on shoes, drinking water, saluting, and squatting. For the “putting on shoes” action, the model, aided by the Sparse Self-Attention Central Difference (SACD) module, focuses more precisely on the motion trajectories of the feet and hands while suppressing irrelevant regions, thereby capturing the most informative features. By learning spatial and temporal dependencies among joints, the model identifies the interaction between the hand and foot and infers the correct topological structure of the action, ultimately leading to accurate classification. These results suggest that the model exhibits strong generalization in real-world scenarios. However, under challenging conditions such as joint occlusion or rare actions, recognition performance may degrade. As a remedy, future work could incorporate random joint masking for data augmentation and leverage kinematic priors to estimate the positions of occluded joints.

Table 14. Recognition results of the model on the Kinetics dataset

Kinetics
Methods	Top-1	Top-5
ST-GCN	30.7	52.8
2 s-AGCN	36.1	58.7
DGNN	36.9	59.6
GCD-NAS	37.1	60.1
MS-AAGCN	37.8	61.0
Dynamic GCN	37.9	61.3
MS-G3D	38.0	60.9
DualHead-Net	38.4	61.3
MST-GCN	38.1	60.8
MSCDGCN	39.8	62.5

[See PDF for image]

Fig. 10

Qualitative action recognition results of MSCDGCN in real-world scenarios. Examples include drinking, putting on shoes, saluting, and squatting. The model successfully focuses on key motion areas (e.g., hand-to-head or hand-to-foot interactions) to recognize actions in noisy and unconstrained environments

Conclusion

In this paper, a novel skeleton-based action recognition model named MSCDGCN is proposed, introducing a sparse self-attention central difference module. This design effectively addresses the limitations of traditional graph convolution, which often overlooks local motion details when processing skeletal data. A multi-scale separable temporal convolution module is proposed to capture temporal features, which not only extracts fine-grained temporal information but also significantly reduces the number of parameters, thereby improving model efficiency. Furthermore, a joint spatial–temporal focus module is introduced to integrate skeletal spatial location information with temporal dependencies, enhancing the model's capability to recognize complex action patterns.

Extensive experiments on the NTU RGB + D 60 and NTU RGB + D 120 datasets demonstrate that MSCDGCN outperforms all baseline methods across the four modalities: J, B, J + B, and 4S. Specifically, MSCDGCN achieves top recognition accuracy of 92.8% (X-Sub) and 96.8% (X-View) on NTU RGB + D 60, and 89.5% (C-Sub) and 91.0% (C-Set) on NTU RGB + D 120. Ablation studies confirm the contribution of each proposed module. Moreover, MSCDGCN attains a competitive accuracy of 39.8% on the cross-dataset Kinetics benchmark.

Future work will focus on designing more compact feature extraction networks and developing efficient methods to optimize skeletal data structures for lightweight applications. Improving skeletal representations through enhanced modeling of joint positions, distances, and angles, along with reducing graph convolution layers to lower computational complexity, could further enhance accuracy and efficiency. To facilitate deployment on resource-constrained devices—such as mobile platforms and embedded systems—model compression techniques including pruning and quantization will be explored. These strategies can reduce model size and computation overhead while maintaining accuracy, enabling more feasible real-time inference in edge computing environments.

Finally, the importance of ethical considerations in skeleton-based action recognition is acknowledged. Although skeleton data appears anonymized, it may still reveal sensitive behavioral patterns. Future research should incorporate privacy-preserving techniques, ensure data security, and uphold informed consent to promote responsible and trustworthy AI applications.

Author contributions

Xinlu Zong conceptualized the research design, Siyu Dong designed the methodology, Jiawei Guo took charge of data compilation and verified the paper.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62472149, 62376089).

Data availability

The data generated and analyzed during the current study available from the corresponding author on reasonable request.

Declarations

Competing interest

There are no financial or personal conflicts of interest.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Ahmad, T; Rizvi, STH; Kanwal, N. Transforming spatio-temporal self-attention using action embedding for skeleton-based action recognition. J vis Commun Image Represent; 2023; 95, [DOI: https://dx.doi.org/10.1016/j.jvcir.2023.103892] 103892.

Awotunde J, Bhoi A, Jimoh R et al (2023) Internet of things based enabled convolutional neural networks in healthcare. In: IoT-enabled convolutional neural networks: techniques and applications, 2rd edn. River Publishers, New York, pp 27–63. https://doi.org/10.1201/9781003393030

Bai R, Li M, Meng B et al (2022) Hierarchical graph convolutional skeleton transformer for action recognition. In: Proceedings of the IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME52920.2022.9859781

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502

Chen, Z; Li, S; Yang, B et al. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. Proc AAAI Conf Artif Intell; 2021; 35, pp. 1113-1122.

Chen Y, Zhang Z, Yuan C et al (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 13359–13368. https://doi.org/10.1109/ICCV48922.2021.01311

Cheng K, Zhang Y, He X et al (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 183–192. https://doi.org/10.1109/CVPR42600.2020.00026

Cheng K, Zhang Y, Cao C et al (2020) Decoupling GCN with DropGraph module for skeleton-based action recognition. In: Proceedings of the European conference on computer vision (ECCV). Lect Notes Comput Sci 12369:536–553. https://doi.org/10.1007/978-3-030-58586-0_32

Cheng, Q; Cheng, J; Ren, Z et al. Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition. Pattern Anal Appl; 2023; 26, pp. 1303-1315. [DOI: https://dx.doi.org/10.1007/s10044-023-01156-w]

Chi H, Ha M, Chi S et al (2022) InfoGCN: Representation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 20154–20164. https://doi.org/10.1109/CVPR52688.2022.01955

Filtjens, B; Vanrumste, B; Slaets, P. Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans Emerg Top Comput; 2024; 12, 1 pp. 202-212. [DOI: https://dx.doi.org/10.1109/TETC.2022.3230912]

Fischer, I. The conditional entropy bottleneck. Entropy; 2020; 22, 9 999.4220449 [DOI: https://dx.doi.org/10.3390/e22090999]

Gao L, Ji Y, Yang Y, Shen H (2022) Global-local cross-view Fisher discrimination for view-invariant action recognition. In: Proceedings of the 30th ACM international conference on multimedia, pp 5255–5264. https://doi.org/10.1145/3503161.3548280

He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

Karim, M; Khalid, S; Aleryani, A et al. Human action recognition systems: a review of the trends and state-of-the-art. IEEE Access; 2024; 12, pp. 36372-36390. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3373199]

Lee J, Rossi R, Kong X (2018) Graph classification using structural attention. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’18), pp 1666–1674. https://doi.org/10.1145/3219819.3219980

Li M, Chen S, Chen X et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3590–3598. https://doi.org/10.1109/CVPR.2019.00371

Liu, J; Shahroudy, A; Xu, D et al. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans Pattern Anal Mach Intell; 2018; 40, 12 pp. 3007-3021. [DOI: https://dx.doi.org/10.1109/TPAMI.2017.2771306]

Liu, J; Shahroudy, A; Perez, M. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans Pattern Anal Mach Intell; 2019; 42, 10 pp. 2684-2701. [DOI: https://dx.doi.org/10.1109/TPAMI.2019.2916873]

Liu Z, Zhang H, Chen Z et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 140–149. https://doi.org/10.1109/CVPR42600.2020.00022

Liu, K; Li, Y; Xu, Y et al. Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Process Lett; 2022; 29, pp. 1883-1887. [DOI: https://dx.doi.org/10.1109/LSP.2022.3199670]

Liu, C; Fu, R; Li, Y et al. A self-attention augmented graph convolutional clustering networks for skeleton-based video anomaly behavior detection. Appl Sci; 2022; 12, 1 4. [DOI: https://dx.doi.org/10.3390/app12010004]

Liu, J; Wang, X; Wang, C et al. Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans Multimedia; 2024; 26, pp. 811-823. [DOI: https://dx.doi.org/10.1109/TMM.2023.3271811]

Ma, Q; Chen, E; Lin, Z et al. Convolutional multitimescale echo state network. IEEE Trans Cybern; 2021; 51, 3 pp. 1613-1625. [DOI: https://dx.doi.org/10.1109/TCYB.2019.2919648]

Miao, F; Meunier, J. Skeleton graph-neural-network-based human action recognition: a survey. Sensors; 2022; 22, 6 2091. [DOI: https://dx.doi.org/10.3390/s22062091]

Nguyen, H-C; Nguyen, T et al. Deep learning for human activity recognition on 3D human skeleton: survey and comparative study. Sensors; 2023; 23, 11 5121. [DOI: https://dx.doi.org/10.3390/s23115121]

Oguntoye, J; Awodoye, O; Oladunjoye, J et al. Predicting COVID-19 from chest X-ray images using optimized convolution neural network. LAUTECH J Eng Technol; 2023; 17, pp. 28-39.

Qiu, J; Yan, X; Wang, W et al. Skeleton-based abnormal behavior detection using secure partitioned convolutional neural network model. IEEE J Biomed Health Inform; 2022; 26, 12 pp. 5829-5840. [DOI: https://dx.doi.org/10.1109/JBHI.2021.3137334]

Selvi, E; Adimoolam, M; Karthi, G et al. Suspicious actions detection system using enhanced CNN and surveillance video. Electronics; 2022; 11, 24 4210. [DOI: https://dx.doi.org/10.3390/electronics11244210]

Shahroudy A, Liu J, Ng T et al (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019. https://doi.org/10.1109/CVPR.2016.115

Shi, X. Driver distraction behavior detection framework based on the DWPose model, Kalman filtering, and multi-transformer. IEEE Access; 2024; 12, pp. 80579-80589. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3406605]

Shi L, Zhang Y, Cheng J et al (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 12018–12027. https://doi.org/10.1109/CVPR.2019.01230

Shi, L; Zhang, Y; Cheng, J et al. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process; 2020; 29, pp. 9532-9545. [DOI: https://dx.doi.org/10.1109/TIP.2020.3028207]

Shi L, Zhang Y, Cheng J et al (2020) Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian conference on computer vision (ACCV), pp 38–53. https://doi.org/10.1007/978-3-030-69541-5_3

Shu, X; Zhang, L; Qi, G et al. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans Pattern Anal Mach Intell; 2022; 44, 6 pp. 3300-3315. [DOI: https://dx.doi.org/10.1109/TPAMI.2021.3050918]

Si C, Chen W, Wang W et al (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1227–1236. https://doi.org/10.1109/CVPR.2019.00132

Tu, Z; Zhang, J; Li, H et al. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition. IEEE Trans Multimed; 2023; 25, pp. 1819-1831. [DOI: https://dx.doi.org/10.1109/TMM.2022.3168137]

Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a Lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 588–595. https://doi.org/10.1109/CVPR.2014.82

Vrahatis, AG; Lazaros, K; Kotsiantis, S. Graph attention networks: a comprehensive review of methods and applications. Future Internet; 2024; 16, 9 318. [DOI: https://dx.doi.org/10.3390/fi16090318]

Wang, P; Li, W; Li, C et al. Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Syst; 2018; 158, pp. 43-53. [DOI: https://dx.doi.org/10.1016/j.knosys.2018.05.029]

Xia, R; Li, Y; Luo, W. Laga-net: local-and-global attention network for skeleton-based action recognition. IEEE Trans Multimed; 2022; 24, pp. 2648-2661. [DOI: https://dx.doi.org/10.1109/TMM.2021.3086758]

Xiong, X; Min, W; Wang, Q et al. Human skeleton feature optimizer and adaptive structure enhancement graph convolution network for action recognition. IEEE Trans Circuits Syst Video Technol; 2023; 33, 1 pp. 342-353. [DOI: https://dx.doi.org/10.1109/TCSVT.2022.3201186]

Xu, B; Shu, X; Zhang, J et al. Spatiotemporal decouple-and-squeeze contrastive learning for semi-supervised skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst; 2024; 35, 8 pp. 11035-11048. [DOI: https://dx.doi.org/10.1109/TNNLS.2023.3247103]

Yan, S; Xiong, Y; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proc AAAI Conf Artif Intell; 2018; [DOI: https://dx.doi.org/10.1609/aaai.v32i1.12328]

Ye F, Pu S, Zhong Q et al (2020) Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the ACM international conference on multimedia (MM ’20), pp 55–63. https://doi.org/10.1145/3394171.3413941

Zaidi A, Aguerri IE (2020) Distributed deep variational information bottleneck. In: Proceedings of the IEEE international workshop on signal processing advances in wireless communications (SPAWC), pp 1–5. https://doi.org/10.1109/SPAWC48557.2020.9154315

Zhang J, Shi X, Xie J et al (2018) GaAN: Gated attention networks for learning on large and spatiotemporal graphs. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI ’18), pp 339–349. https://api.semanticscholar.org/CorpusID:3973810

Zhang, P; Lan, C; Xing, J et al. View adaptive neural networks for high-performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell; 2019; 41, 8 pp. 1963-1978. [DOI: https://dx.doi.org/10.1109/TPAMI.2019.2896631]

Zhang P, Lan C, Zeng W et al (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1109–1118. https://doi.org/10.1109/CVPR42600.2020.00119

Zhou, T; Li, J; Wang, S et al. MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process; 2020; 29, pp. 8326-8338. [DOI: https://dx.doi.org/10.1109/TIP.2020.3013162]

Word count: 10936

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

MSCDGCN: Multi-scale central difference graph convolution network for skeleton-based action recognition

Content area

Abstract

Full text