A non-anatomical graph structure for boundary detection in continuous sign language

Abstract

Recently, the challenge of the boundary detection of isolated signs in a continuous sign video has been studied by researchers. To enhance the model performance, replace the handcrafted feature extractor, and also consider the hand structure in these models, we propose a deep learning-based approach, including a combination of the Graph Convolutional Network (GCN) and the Transformer models, along with a post-processing mechanism for final boundary detection. More specifically, the proposed approach includes two main steps: Pre-training on the isolated sign videos and Deploying on the continuous sign videos. In the first step, the enriched spatial features obtained from the GCN model are fed to the Transformer model to push the temporal information in the video stream. This model in pre-trained only using the pre-processed isolated sign videos with same frame lengths. During the second step, the sliding window method with the pre-defined window size is moved on the continuous sign video, including the un-processed isolated sign videos with different frame lengths. More concretely, the content of each window is processed using the pre-trained model obtained from the first step and the class probabilities of the Fully Connected (FC) layer embedded in the Transformer model are fed to the post-processing module, which aims to detect the accurate boundary of the un-processed isolated signs. In addition, we propose to present a non-anatomical graph structure to better present the hand joints movements and relations during the signing. Relying on the proposed non-anatomical hand graph structure as well as the self-attention mechanism in the Transformer model, the proposed model can successfully tackle the challenges of boundary detection in continuous sign videos. Experimental results on two datasets show the superiority of the proposed model in dealing with isolated sign boundary detection in continuous sign sequences.

Full text

Translate

Turn on search term navigation

Introduction

Hand gestural communications are widely used in different types of communications in our daily life¹. Sign language, as the most grammatically structured category of these communications, is the preliminary language of the Deaf community². In addition to this, many other application areas use gestural communications, such as motion analysis and human-computer interaction¹. Vision-based dynamic hand sign language recognition aims to recognize hand sign labels from video inputs. This task is an active research area in computer graphics and human-computer interaction with a wide range of applications in VR/AR³, healthcare⁴, and robotics⁵. Generally, there are two types of vision-based dynamic hand sign language recognition: Isolated and Continuous. Continuous hand sign recognition is different from isolated hand sign classification² or sign spotting⁶, which is to detect the predefined hand signs from a video stream and the supervision includes exact temporal locations for each sign. The critical point in continuous hand sign recognition is that each video stream of sign sentence contains its ordered gloss labels but no time boundaries for each gloss. So, the main problem with continuous hand sign recognition is learning the corresponding relations between the image time series and the sequences of glosses.

Recently, Deep Learning models have achieved breakthroughs in many tasks, such as Human Action Recognition^7,8, Hand Gesture Recognition⁹, and Isolated Sign Language Recognition^{10, 11, 12, 13, 14, 15, 16, 17–18}. However, continuous sign language recognition remains challenging and non-trivial. More concretely, the recognition system needs to obtain spatio-temporal features from the weakly supervised unsegmented video stream. Since data includes the video sequences along with the gloss labels at the sentence level, a large amount of data samples is required to align the sign and gloss labels correctly without overfitting. Furthermore, the boundaries of isolated signs are not available in datasets. To tackle this challenge, recently, some models have been presented to solve the challenge of the boundary detection of isolated signs in a continuous sign video^16,28. The these models could efficiently detect the isolated signs in a continuous video stream. However, using the handcrafted features and also employing the raw data without considering the connections between the hand keypoints are the main drawbacks of these models. To improve the recognition accuracy, use the fully automated features, and consider the connections between the hand keypoints, we propose a deep learning-based model, considering the recent breakthroughs of Graph Convolutional Networks (GCN) and Transformer model in many research areas, especially sign language and gesture recognition^19,28. To this end, we propose a deep learning-based approach using the GCN, Transformer, and a post-processing mechanism for isolated sign boundary detection in the continuous sign video.

The majority of the proposed models for gesture and hand sign recognition have used deep Convolutional Neural Networks (CNNs) to obtain the features in each frame of the continuous video^8,11,13. After that, these features are fed to a temporal learning model, such as RNN, LSTM, or GRU. This combination of a CNN and RNN/LSTM/GRU can be sensitive to noisy backgrounds, occlusions, and different camera viewpoints. To tackle these challenges, another modality, the human body/hand skeleton, has been used by many researchers in other research areas^1,6,8,13. The more compact representation, better robustness against occlusion and viewpoint changes, and higher expressive capability in capturing features in both temporal and spatial domains are some of the advantages of the skeleton modality over the image/video modalities. A tailored way to represent the human skeleton is using graphs. To this end, the skeleton joints and bones are defined as graph nodes and edges, respectively. Here, the point is how to optimize the topology of such a graph. Generally, many research areas select the anatomical structure of hands to use the normal connection between hand joints and benefit from the biologically accurate representation of the human hands structures, more realistic and natural-looking hand movements generation, and better interpretability. However, considering the scope of this paper and analyzing the hand movement as well as the interaction between the hand joints, we propose to use a non-anatomical graph structure for continuous hand sign language recognition, aiming to revolutionize the graph structure used in this area. Results confirm that we need to rethink the graph structure used in human/hand skeleton-based graphs. Our contributions can be summarized as follows:

Non-Anatomical Graph Structure for Hand Modeling: A novel non-anatomical hand graph structure is proposed, departing from conventional anatomical or predefined skeletal connections. This design is derived from an analysis of functional hand joint interactions observed during signing. By focusing on movement-driven connectivity rather than static anatomy, more semantically meaningful and adaptive spatial representations are enabled. To the best of our knowledge, this is the first time such a graph structure is proposed to the continuous hand sign language recognition.
GCN-Transformer Fusion with Confidence-Based Post-Processing: A hybrid architecture is introduced that integrates Graph Convolutional Networks (GCNs) for spatial representation with Transformer encoders for modeling long-range temporal dependencies. This combination facilitates accurate boundary detection of isolated signs within continuous sign sequences. To further enhance robustness, a softmax-based confidence thresholding mechanism is applied in post-processing to filter ambiguous or low-confidence predictions. This integrated approach improves the system’s reliability, particularly for visually similar signs, a limitation not explicitly addressed by prior models.
Superior Performance on Dual-Form Datasets: The proposed model is evaluated on two comprehensive datasets–RKS-PERSIANSIGN and ASLLVD–each containing both isolated and continuous sign sequences. It achieves superior recognition performance, with average softmax confidence scores of 0.9955 and 0.6850, respectively. Beyond accuracy, computational efficiency is demonstrated through runtime and FLOPs analysis. The model outperforms state-of-the-art methods by achieving lower runtime and reduced FLOPs, enabling real-time inference and making it suitable for deployment in resource-constrained environments. These results underscore the model’s practical applicability alongside its strong recognition capabilities.

The remaining of the paper is organized as follows. Related literature is reviewed in literature review section. Details of the proposed methodology are described in proposed model section. The proposed graph structure is presented in anatomical vs. non-anatomical graph structure section. Results are discussed in results section. Finally, the work is discussed and concluded in discussion and conclusion section.

Literature review

Here, we briefly review recent works in hand gesture recognition, Sign Language Recognition (SLR), and Human Action Recognition (HAR) using GCN and other models, such as CNN and transformer models.

Recognition using GCN: Meng and Li proposed a multi-scale and dual sign language recognition Network (SLR-Net) using a GCN on the skeleton data extracted from the RGB video inputs. The proposed model consists of three main parts: multi-scale attention network (MSA), multi-scale spatiotemporal attention network (MSSTA), and attention-enhanced temporal convolution network (ATCN). These parts are used to learn the dependencies between long-distance vertices, the spatiotemporal features, and also the long-temporal dependencies, respectively. Results on two datasets, CSL-500 and DEVISIGN-L, show that the proposed model outperforms state-of-the-art models in SLR²⁰. However, it is not clear how much these three attention mechanisms overlap. Vazquez-Enrıque et al. proposed a graph-based model for skeleton-based Isolated Sign Language Recognition (ISLR). They benefit from the advantages of the multi-scale spatial-temporal graph convolution operator, MSG3D, to use the semantic connectivity among non-neighbor nodes of the graph in a flexible temporal scale. Results on the AUTSL dataset show promising results in ISLR²¹. Although, the impact of each scale is not clear. Aiming to include handcrafted features in the GCN model and facilitate the learning process, Degardin et al. have introduced a model, entitled REasoning Graph Convolutional Networks IN Human Action Recognition (REGINA). To this end, a handcrafted Self-Similarity Matrix (SSM) is applied to the temporal graph convolution part, aiming to enhance global connectivity across the temporal axis. Results on the NTU RGB+D dataset show competitive results with the state-of-the-art models in skeleton-based action recognition²². Since the REGINA aims to contribute to performance improvement in other models, using the handcrafted features in this approach is challenging. In another work, Duhme et al. proposed a graph-based model for action recognition from various sensor data modalities. Different modalities are fused on two-dimensionality levels: channel dimension and spatial dimension. Results on two publicly available datasets, UTD-MHAD and MMACT, demonstrate that the proposed model obtains a relative improvement margin of up to 12.37% (F1-Measure) with the fusion of skeleton estimates and accelerometer measurements²³. However, this model suffers from misclassification errors in the absence of object interactions in the data. Li et al. proposed an encoder-decoder structure to obtain action-specific latent dependencies from actions. To this end, they include higher-order dependencies of the skeletal data in the graph structure. Stacking some instances of the proposed graph, the proposed model aims to learn both spatial and temporal features for action recognition. Results on the NTURGB+D and Kinetics datasets show that the proposed model outperforms the state-of-the-art models in action recognition²⁴. However, the impact of future pose prediction on the model performance needs to be more discussed.
Recognition using other models: Samir Elonsa et al. proposed an approach for tackling pose variations in 3D hand recognition of SLR using a Pulse-Coupled Neural Network (PCNN). This network is employed for image feature generation from two different viewing angles. Using a fitness function, the generated features are evaluated before the classification step to dynamically assign weights to each camera viewing angle. Results on the Arabic sign language dataset with multiple views show the effectiveness of the proposed approach for SLR. However, the generalization of the proposed approach needs to be evaluated using more datasets²⁶. Rastgoo et al. proposed a two-stage model, including a combination of CNN, SVD, and LSTM. After training with the isolated signs, a post-processing algorithm is applied to the Softmax outputs obtained from the first part of the model in order to separate the isolated signs in the continuous sign videos. Results of the continuous sign videos, created using two public datasets in Isolated Sign Language Recognition (ISLR), RKS-PERSIANSIGN and ASLLVD, confirm the efficiency of the proposed model in dealing with isolated sign boundaries detection¹⁶. However, the challenge of using handcrafted features still remains in this model. Aiming to tackle this challenge, a Transformer-based model has been proposed, benefiting from the parallel computing and self-attention mechanism of the Transformer model, which can accurately detect the isolated sign boundaries in the continuous sign video. However, the joint connections in this model has been ignored²⁸. Zhang et al. designed a transformer-based model, namely a Heterogeneous Attention-based Transformer(HAT), for attention generation from diverse spatial and temporal contextual levels in sign language translation. Compared to conventional transformer models, HAT obtains more effective visual-text representations. Results on the PHOENIX2014T dataset show the superiority of the proposed model compared to state-of-the-art models in sign language translation²⁷. However, the model generalization needs to be evaluated on more datasets. A dynamic sign word recognition method has been suggested in²⁹, in which multi-scale spatio-temporal features are combined with a lightweight feature selection strategy and an End-to-End Fourier Convolutional Neural Network (EFCNN). 3D hand motion features are extracted using pixel-weighted spatial alignment and processed in the Fourier domain. To enhance performance, a feature selection variant, FS-EFCNN, is introduced, where compact and semantically meaningful features are selected using several advanced methods. High accuracy is achieved on the ASL, BSL, and GSL datasets. Additionally, an enhanced approach to weak spatial modeling for sign language recognition has been developed through the use of CNN-based deep skeletal feature transformation. Challenges associated with capturing subtle spatial dynamics in skeletal data are addressed by introducing a novel framework in which raw skeletal sequences are transformed into structured spatial representations. Deep spatial patterns are extracted using a convolutional neural network, allowing for improved robustness to motion variations and signer differences. The effectiveness of the proposed method is demonstrated through experiments on benchmark sign language datasets, where higher recognition accuracy and reduced computational complexity are achieved compared to conventional skeletal modeling techniques³⁰. In addition, a novel framework for ASL word recognition has been suggested in³¹, where spatio-temporal prosodic and angular features are extracted and modeled through a sequential learning approach. Hand motion dynamics and joint angles are encoded to capture both temporal rhythm and spatial structure. These features are processed using a recurrent learning model, allowing sequential dependencies to be effectively learned. The proposed method is evaluated on benchmark ASL datasets, where improved recognition accuracy is achieved compared to existing approaches, demonstrating the effectiveness of combining prosodic cues with spatial angular information. Another model, entitled Fsign-Net, has been introduced in³² for sign word recognition using depth sensor data and frame-based Fourier feature modeling. Multi-scale spatio-temporal features are extracted from aggregated depth video frames, where pixel-weighted alignment is applied to capture dynamic hand movements. These features are transformed in the Fourier domain to encode temporal dependencies efficiently. A low-cost feature selection mechanism is integrated to retain the most informative and semantically relevant features. The proposed method is evaluated on multiple sign language datasets, where superior recognition accuracy is achieved, demonstrating the effectiveness of Fsign-Net in real-world dynamic sign recognition scenarios.

Proposed model

Here, we describe the details of the proposed model, as Fig. 1 shows. Recently, the challenge of the boundary detection of isolated signs in a continuous sign video has been attracted the researchers^16,28. To enhance the model performance, replace the handcrafted feature extractor in the previous models, and employ the graph structure of the hands, we propose a deep learning-based model using the GCN and Transformer models combined with the post-processing mechanism to accurately detecting the isolated signs boundaries in the continuous sign videos. More concretely, the proposed model contains two main steps:

Pre-training on the isolated sign videos: During the pre-training phase, only the isolated signs are used in the model. In this way, after the frame extraction from the input isolated sign video, each frame is fed to the OpenPose model³³, as the state-of-the-art model for pose estimation. 21 3D hand keypoints are obtained and used in the proposed model. For simplicity, we normalize these coordinates in [0, 96] interval. After analyzing the hand skeleton with anatomical and non-anatomical structures, we are encouraged to replace the anatomical graph structure with some non-anatomical graph structures. We have experimentally checked many different graph structures. However, only three structures have been shown and reported in this work (See Fig. 2). Our study show that we do not necessarily need to use the anatomical structure of the hand skeleton. Each proposed graph includes 21 nodes that the estimated 3D hand keypoints are considered as the node property/message. More details on the graph structure will be presented in the next section. Using the GCN model as an efficient feature extractor to enrich the 3D hand features of both hands, we obtain the richer embeddings in the nodes. The GCN uses the neighbor nodes in a graph to empower the node features. So, the output graph obtained from the GCN is a graph with richer features for each node. For each hand, we use a two-layer GCN, separately. The property/message of all nodes in the graph are flattened to a vector called graph embedding. This process is separately performed for each hand. The graph embedding corresponding to the left and right hands are concatenated frame-by-frame and fed to the next part of the model, which is the Transformer Encoder. Relying the self-attention mechanism and parallel computing of the Transformer model, the richer features, including a vector of the 3D hand keypoints, are obtained. Finally, the FC layers are used to enhance the model learning. The proposed GCN-Transformer model is pre-trained using the pre-processed isolated sign videos with same frame lengths.
Deploying on the continuous sign videos: After the GCN-Transformer model pre-training using the isolated sign videos, this model is employed in the second step using the continuous sign videos containing the isolated sign videos with different frame lengths. In this way, the sliding window mechanism with the predefined window size is applied to the continuous sign videos. During the frame processing corresponding to each window, the pre-trained model is used and the class probabilities obtained from the Softmax activation function, embedded in the last FC layer, are passed to the post-processing mechanism¹⁶ to detect the isolated sign boundaries. The intuition behind the post-processing methodology is to enhance the recognition accuracy by removing the untrained and repetitive signs using the sliding window approach. To enhance the clarity and reproducibility of our methodology, we present the pseudocode for both the offline training and online inference processes, along with the overall model flowchart, in Figs. 3, 4 and 5. Specifically, Fig. 3 illustrates the pseudocode for the offline training stage, detailing the pre-training steps of the GCN-Transformer model using isolated sign videos with uniform frame lengths. Fig. 4 presents the pseudocode for the online inference stage, which includes sliding window-based processing of continuous sign videos and post-processing for boundary detection. Finally, Fig.5 provides a comprehensive flowchart of the proposed model, encompassing both the offline training and online inference stages. These additions are intended to improve the transparency of our workflow and facilitate the replication or further development of our approach.

[See PDF for image]

Fig. 1

The proposed model.

[See PDF for image]

Fig. 2

Three perspectives of the hand: (a) Anatomical, (b) Hand keypoints, (c) Hand skeleton.

[See PDF for image]

Fig. 3

The pseudocode for the offline training stage of the proposed model.

[See PDF for image]

Fig. 4

The pseudocode for the online training stage of the proposed model.

[See PDF for image]

Fig. 5

The flowchart of the proposed model.

[See PDF for image]

Fig. 6

The anatomical hand skeleton graph used in the proposed model. This graph is defined as (a).

[See PDF for image]

Fig. 7

The non-anatomical graph structure used in the proposed model. This graph is defined as (b).

[See PDF for image]

Fig. 8

The non-anatomical graph structure used in the proposed model. This graph is defined as (c).

Anatomical vs. non-anatomical graph structure

Fig. 2 presents the anatomical structure, hand keypoints, and hand skeleton used in this study. Moreover, Figs. 6, 7 and 8 show one anatomical and two non-anatomical graph structures of hand. As these figures show, in the anatomical structure, the hand is separated into two parts: the fingers and the palm. Considering the joints of each finger, four keypoints are considered in each finger. Only one keypoint is assigned to the hand palm. After investigating the hand anatomy and also the patterns of different hand gestures, we observed that there are many alternatives for anatomical graph structures, which are extensively used in literature. We experimentally checked and analyzed many of them. However, we only present three of them in the paper, including the anatomical graph structure and two best non-anatomical graph structures with the highest accuracy. All of the proposed structures consider the natural hand, including five fingers that each finger has four joint connections. To model the hand in this structure, four keypoints are considered on each finger, as Fig. 2 (a) shows. Another keypoint is put on the hand palm. So, we will have 21 keypoints on each hand in all proposed hand structures. The difference between the proposed hand structures is in the connections between these keypoints. Details of these structures are described as follows:

Anatomical structure: This structure is based on the natural anatomy of the hand, which is extensively used in literature. In this structure, a keypoint, put on the hand palm, is connected to all keypoints near this keypoint. The keypoints put on each finger are separately connected to each other. Here, we have 20 connections between 21 hand keypoints. The pattern of this structure is shown in Fig. 6.
Non-Anatomical structure with one finger plus one point references: In this structure, four keypoints, put on one finger, and also one keypoint corresponding to the hand palm, are considered as reference points. The connections start from the reference points on a finger to the other aligned hand keypoints. Finally, the palm keypoint is connected to all keypoints on the reference finger. As Fig. 7 shows, there are 20 connections in this structure. The intuition behind selecting all keypoints on a finger plus one palm keypoint as the reference points is the smooth movement of this finger and also the palm keypoint in different signs. So, we referred all connections to this finger.
Non-Anatomical structure with one reference point: Similar to the anatomical structure, we consider the palm keypoint as a reference point. However, the connections are referenced from this point to the other keypoints. As Fig. 8 shows, we have 20 connections from the palm keypoint to the other 20 hand keypoints. The intuition behind the reference point is that the palm keypoint has a smooth movement in different signs. So, we can consider it as a gravity center to handle the other keypoints.

Results

In this section, we present details of the datasets and results. First of all, the implementation details are described. After that, two datasets, used for evaluation, are briefly introduced. Finally, the details of the ablation analysis of the proposed graphs and also the comparison with state-of-the-art models are discussed.

Implementation details

Our evaluation has been done on an Intel(R) Xeon(R) CPU E5-2699 (2 processors) with 90GB RAM with Microsoft Windows 10 operating system and Python software with NVIDIA Tesla K80 GPU. The PyTorch library has been used for model implementation. The parameters used in the implementation are shown in Table 1. More specifically, we discuss the clarification regarding the hyperparameter choices in our experimental design as follows:

Training Epochs (10,000 with Early Stopping): We initially set a high upper limit of 10,000 training epochs to ensure that the model had sufficient opportunity to converge, particularly given the complexity of spatio-temporal patterns in sign language data. However, to prevent overfitting and unnecessary computation, we applied an early stopping mechanism with patience set to 50 epochs, which typically led to convergence much earlier (often within 200–300 epochs, depending on the dataset). This strategy allowed the model to train adaptively without requiring manual tuning of the epoch count for each dataset.
GCN Architecture (512-256-128 and 100 Units): The GCN layers with dimensions 512, 256, and 128 were empirically selected based on preliminary experiments aimed at balancing model capacity and overfitting risk. These dimensions progressively reduce feature dimensionality while enabling the model to extract hierarchical spatial representations. The final dense layer with 100 units corresponds to the number of classes in the dataset, ensuring alignment with the classification output space.
Model Selection Criteria: Model selection was based on the obtained accuracy. Specifically, we performed ablation studies (see Table 1) to evaluate the impact of different architectural choices using different combinations of GCN, BiLSTM, Attention,a nd Transformer, further guiding our model configuration.

Table 1. Details of the parameters used in the proposed model.

Parameter	Value	Parameter	Value
Weight decay	1e-4	Epoch numbers (Early stopping)	200
Learning rate	0.005	Number of frames per video sample	50
Batch size	50	Processing way	GPU
Keypoint dimension	21x3	Number of GCN layer	2
Optimizer	Adam	Dataset split ration for test data	20%
GCN layer numbers	2	Aggregation type	‘Sum’

Datasets

The available datasets in continuous sign language recognition do not contain the pair of continuous sign videos and the corresponding isolated signs in each video. So, to cope with this challenge, two datasets in isolated sign recognition, RKS-PERSIANSIGN¹³ and ASLLVD²⁵, are utilized for evaluation. In RKS-PERSIANSIGN, 10 contributors have performed 100 Persian signs 10 times in a simple background with a maximum distance of 1.5 meters between the contributor and camera. So, there are 10000 RGB isolated sign videos in this dataset. The ASLLVD dataset includes some annotated American isolated signs. Due to the different sample numbers in each class label, 100 signs, including at least seven video samples, are selected and used. Using the pre-processing, all video samples have equally frame numbers during the training phase. Whereas, we do not perform any pre-processing on the test data and use them with different frame numbers.

Ablation analysis

Here, we perform an ablation analysis on two datasets used for evaluation (Table 2). In the first step, we used the GCN along with the LSTM Network. To improve the model performance, different GCN, LSTM, and FC layer numbers have been experimented and analyzed. The analysis showed that the model performance was improved using higher FC layers. However, the improvement trend was stopped with four FC layers. So, we had the highest performance using two, one, and four layers for GCN, LSTM, and FC, respectively. After that, we substituted the LSTM with Bi-LSTM Network. Unlike the standard LSTM, the input flows in Bi-LSTM are in both directions. As a result, it is capable of utilizing information from both sides. Results confirmed the efficiency of the Bi-LSTM Network. We also analyzed the impact of different layer numbers for the Bi-LSTM Network. As the trend of this table shows, a higher accuracy is obtained using three Bi-LSTM layers. Finally, we substituted the Bi-LSTM Network with the Transformer Encoder. Different layer and head numbers have been analyzed during our experiments. As the results in Table 2 show, the highest recognition accuracy is obtained using the combination of the GCN with the Transformer Encoder with 12 layers and 8 heads. We only reported the results of three graph structures with GCN, Bi-LSTM, FC, and Transformer combination. It is worth mentiong that since each isolated sign input consists of a sequence of frames, the proposed hybrid model is designed to effectively capture both spatial and temporal features. Using only GCN or only LSTM may not sufficiently or simultaneously extract these complementary features. To address this, we combined both models to leverage their strengths.

Table 2. Ablation analysis of the proposed model using different graph topology and model architecture on two datasets. GT: Graph Topology, LH: LSTM Hidden, TLH: Transformer Layers and Heads.

GT	Model	FC	LH/TLH	Epoch	Accuracy	ASLLVD
GT	Model	FC	LH/TLH	Epoch	RKS-PERSIANSIGN	ASLLVD
(a)	GCN-BiLSTM-Attention	4 (512-256-128-100)	400-400	10000	87.40	86.00
(a)	GCN-BiLSTM-BiLSTM-BiLSTM-Attention	4 (512-256-128-100)	400-400	10000	87.80	87.00
(a)	GCN-Transformer	4 (512-256-128-100)	12-8	10000	88.90	87.80
(b)	GCN-BiLSTM-BiLSTM-Attention	4 (512-256-128-100)	400-400	10000	99.80	94.80
(b)	GCN-Transformer	4 (512-256-128-100)	12-8	10000	99.80	95.20
(b)	GCN-BiLSTM-BiLSTM-BiLSTM-Attention	4 (512-256-128-100)	400-400	10000	99.85	95.50
(c)	GCN-BiLSTM-BiLSTM-Attention	4 (512-256-128-100)	400-400	10000	99.85	95.25
(c)	GCN-Transformer	4 (512-256-128-100)	12-8	10000	99.90	96.00

Statistical analysis

We conducted separate one-sample t-tests at a 5% significance level for each dataset to assess whether the mean recognition accuracies from two independent sets of runs–one comprising 50 trials and the other 10 trials–differ significantly. As shown in Table 3, the statistical analysis indicates no significant difference between the two sets, suggesting that the mean recognition accuracies are statistically equivalent.

Table 3. Statistical analysis of the recognition accuracies using t-test method.

	RKS-PERSIANSIGN	ASLLVD
	= 99.90	= 99.90
	99.90	99.90
t-test	is accepted	is accepted

Comparison with state-of-the-art

Here, we compare the current model proposed in this work with previous approaches^16,28 that were designed for isolated sign separation in continuous sign language videos.

Given that we have used two datasets whose most important feature is the inclusion of both sign sentences and the corresponding isolated signs within those sentences, the availability of comparable methods is further limited. To our knowledge, no other publicly available datasets exhibit this unique characteristic, which is central to our proposed framework. Given this limitation, we selected the two most relevant and competitive baseline methods available in the literature that align as closely as possible with our problem setting. These were chosen based on their widespread adoption and relevance to key aspects of our task^16,28.

As Tables 4 and 5 show, the proposed model in this work has a better performance. As the number of false recognition and also the average of the recognized Softmax outputs on the RKS and ASL datasets show, there is still some false recognition coming from the similarities between the signs in the datasets. In the proposed model, the average of recognized Softmax outputs on RKS-PERSIANSIGN and ASLLVD are 0.99 and 0.68, respectively.

Finally, a detailed comparison of model efficiency–in terms of both runtime and FLOPs–against state-of-the-art approaches is presented in Table 6. These metrics are reported alongside recognition accuracy to provide a more comprehensive evaluation of the proposed method. While our primary objective was to enhance recognition accuracy and boundary detection in continuous sign language sequences, we also recognize the importance of evaluating computational efficiency. All models listed in Table 6 incorporate the SSD model for hand detection, for which we allocate a consistent training time of 72 hours. In addition, the training durations for the models in^16,28, and our proposed method are 12 hours, 7 hours, and 8 hours, respectively. Notably, our methodology for boundary detection relies on training the GCN and Transformer. Additionally, the proposed model uses the pre-trained SSD and OpenPose models. We do not have any training process for Post-processing methodology. To mitigate overfitting and improve generalization to unseen data, we adopt a k-fold cross-validation strategy. This approach facilitates fine-tuning and enhances the robustness of the model’s performance. Furthermore, the proposed model benefits from the structural simplicity of a non-anatomical graph and the parallel processing capabilities of the Transformer, both of which contribute to efficient inference. Unlike the model in¹⁶, our integration of the GCN enables effective feature learning without the need for handcrafted features, reducing preprocessing complexity and improving scalability. Importantly, our model achieves superior recognition accuracy compared to both¹⁶ and²⁸, underscoring its effectiveness in data-rich scenarios. Overall, these comparisons highlight that the proposed method not only advances recognition performance but also delivers meaningful improvements in computational efficiency over existing methods.

Table 4. Comparison with the state-of-the-art model.

Model	Number of false recognition
Model	RKS-PERSIANSIGN	ASLLVD
¹⁶	24	12
²⁸	9	8
Proposed model	9	7

Table 5. The average of recognized Softmax outputs on the RKS and ASL datasets.

Model	The average of recognized Softmax outputs
Model	RKS-PERSIANSIGN	ASLLVD
¹⁶	0.9800	0.5900
²⁸	0.9930	0.6820
Proposed model	0.9955	0.6850

Table 6. Analysis on the runtime of the proposed model and comparative models.

Model	Runtime		FLOPs
Model	RKS-PERSIANSIGN	ASLLVD	FLOPs
¹⁶	84 h	78 h	58B
²⁸	79 h	75.5 h	100B
Proposed model	80 h	75.8 h	102B

Practical limitations

Here, we emphasize the importance of assessing the practical limitations of our method, particularly in real-world scenarios where factors such as lighting conditions, camera orientation, and user gesture variability can significantly affect performance. While our current study primarily focuses on developing a robust model for boundary detection in continuous sign videos using standardized datasets, we acknowledge the need for deeper analysis of these external factors for real-world deployment. Regarding lighting sensitivity, although our skeleton-based hand joint features help mitigate the impact of illumination changes and the datasets used (RKS-PERSIANSIGN and ASLLVD) include moderate lighting variation, we recognize that further controlled testing under extreme conditions is needed. In terms of camera orientation, our model currently assumes a fixed frontal view, and we plan to incorporate viewpoint-invariant representations or synthetic data augmentation to address this limitation. For user gesture variability, although our sliding window approach and Transformer-based temporal modeling help accommodate differences in gesture speed, style, and hand size, we agree that broader evaluation across diverse signers and dialects is essential to strengthen generalization and robustness.

Table 7. Details of the recognition accuracy of the proposed post-processing algorithm on the ASLLVD dataset. CSV: Concatenated Sign Video, ASORC: Avg. of Softmax Output of Recognized Class, GTWC: Ground Truth Word Class, SOGTWC: Softmax output of Ground Truth Word class, RC: Recognized Class, SORC: Softmax output of Recognized class.

CSV	ASORC	GTWC	SOGTWC	RC	SORC
1	0.68	18	0.33	8	0.35
2	0.69	8	0.37	18	0.38
3	0.68	80	0.44	50	0.45
4	0.69	18	0.38	8	0.39
5	0.68	64	0.34	51	0.35
6	0.69	51	0.34	64	0.37
7	0.68	51	0.34	64	0.35

Table 8. Details of the recognition accuracy of the proposed post-processing algorithm on the RKS-PERSIANSIGN dataset. CSV: Concatenated Sign Video, ASORC: Avg of Softmax Output of Recognized Class, GTWC: Ground Truth Word Class, SOGTWC: Softmax output of Ground truth Word class, RC: Recognized Class, SORC: Softmax output of Recognized class.

CSV	ASORC	GTWC	SOGTWC	RC	SORC	CSV	ASORC	GTWC	SOGTWC	RC	SORC	CSV	ASORC	GTWC	SOGTWC	RC	SORC
1	0.98	17	0.45	19	0.46	34	0.97	86	0.44	66	0.49	68	0.98	63	0.48	45	0.49
2	0.98	-	-	-	-	35	0.99	-	-	-	-	69	0.99	-	-	-	-
3	0.98	-	-	-	-	36	0.99	-	-	-	-	70	0.99	-	-	-	-
4	0.99	-	-	-	-	37	0.99	-	-	-	-	71	0.99	-	-	-	-
5	0.99	-	-	-	-	38	0.99	-	-	-	-	72	0.99	-	-	-	-
6	0.99	-	-	-	-	39	0.99	-	-	-	-	73	0.99	-	-	-	-
7	0.98	-	-	-	-	40	0.98	-	-	-	-	74	0.99	-	-	-	-
8	0.99	-	-	-	-	41	0.99	-	-	-	-	75	0.99	-	-	-	-
9	0.99	-	-	-	-	42	0.99	-	-	-	-	76	0.99	-	-	-	-
10	0.98	19	0.46	17	0.47	43	0.99	-	-	-	-	77	0.97	63	0.47	45	0.48
11	0.99	-	-	-	-	44	0.98	45	0.45	63	0.46	78	0.99	-	-	-	-
12	0.99	-	-	-	-	45	0.99	-	-	-	-	79	0.99	-	-	-	-
13	0.99	-	-	-	-	46	0.99	-	-	-	-	80	0.99	-	-	-	-
14	0.99	-	-	-	-	47	0.99	-	-	-	-	81	0.99	-	-	-	-
15	0.99	-	-	-	-	48	0.99	-	-	-	-	82	0.99	-	-	-	-
16	0.99	-	-	-	-	49	0.99	-	-	-	-	83	0.99	-	-	-	-
17	0.99	-	-	-	-	50	0.99	-	-	-	-	84	0.99	-	-	-	-
18	0.99	-	-	-	-	51	0.99	-	-	-	-	85	0.99	-	-	-	-
19	0.99	-	-	-	-	52	0.99	-	-	-	-	86	0.99	-	-	-	-
20	0.99	-	-	-	-	53	0.99	-	-	-	-	87	0.99	-	-	-	-
21	0.99	-	-	-	-	54	0.99	-	-	-	-	88	0.99	-	-	-	-
22	0.99	-	-	-	-	55	0.99	-	-	-	-	89	0.99	-	-	-	-
23	0.99	-	-	-	-	56	0.99	-	-	-	-	90	0.99	-	-	-	-
24	0.99	-	-	-	-	57	0.99	-	-	-	-	91	0.99	-	-	-	-
25	0.97	19	0.46	17	0.49	58	0.99	-	-	-	-	92	0.99	-	-	-	-
26	0.99	-	-	-	-	59	0.98	17	0.46	19	0.48	93	0.97	45	0.45	63	0.46
27	0.99	-	-	-	-	60	0.99	-	-	-	-	94	0.99	-	-	-	-
28	0.99	-	-	-	-	61	0.99	-	-	-	-	95	0.99	-	-	-	-
29	0.99	-	-	-	-	62	0.99	-	-	-	-	96	0.99	-	-	-	-
30	0.99	-	-	-	-	63	0.99	-	-	-	-	97	0.99	-	-	-	-
31	0.99	-	-	-	-	64	0.99	-	-	-	-	98	0.99	-	-	-	-
32	0.99	-	-	-	-	65	0.99	-	-	-	-	99	0.99	-	-	-	-
						66	0.99	-	-	-	-

Discussion and conclusion

Recently, different models have significantly contributed to skeletal-based sign language recognition through various spatial-temporal modeling strategies^16,28, yet they exhibit key limitations that our proposed method addresses. Unlike approaches that rely on fixed transformations, handcrafted features, or CNN/LSTM-based sequential models, our framework introduces a non-anatomical GCN to adaptively encode spatial hand structures and a Transformer to effectively capture long-range temporal dependencies. This design enables superior performance in continuous sign boundary detection–a critical yet overlooked aspect in prior work. Furthermore, by eliminating reliance on handcrafted features and integrating a post-processing strategy for precise segmentation, our model achieves better scalability, efficiency, and generalization across diverse datasets. These advancements position our method as a more robust and accurate solution for real-world continuous sign language recognition tasks. More specifically, We discuss the proposed model from three perspectives as follows:

Non-anatomical graph structure: In this work, we proposed some non-anatomical graph structures, aiming to revolutionize the common graph structure used in sign language recognition. After investigating the hand anatomy and also the patterns of different hand signs, we observed that there are many alternatives for anatomical graph structures, which have been extensively used in literature. While both of the non-anatomical and anatomical graph structures consider 21 keypoints on the hand, the difference between them is in the connections between these keypoints. Relying on the experimental results, the highest performance is obtained using a non-anatomical structure with one reference point. This structure considers the palm keypoint as a reference point that the connections are referenced from this point to the other keypoints.
Model: Considering the recent advances in boundary detection of the isolated sign videos in the continuous sign sequences, we proposed to enhance the model performance, replace the handcrafted feature extractor in the previous models, and employ the graph structure of the hands. To this end, we propose a deep learning-based model using the GCN and Transformer models combined with the post-processing mechanism to accurately detecting the isolated signs boundaries in the continuous sign videos. Relying on the capabilities of GCN model for feature enrichment, the self-attention mechanism and parallel computing methodology embedded in the Transformer model, and also the post-processing mechanism, the proposed model improved the accuracy of boundary detection of isolated signs in continuous sign videos. Different ablation analysis has been performed on different parts of the model to show the necessity of each part. Our analysis confirmed the efficiency of the Transformer model vs. the Bi-LSTM and LSTM Networks due to self-attention mechanism and parallel computing. As Table 2 shows, the highest accuracy is obtained using the third graph topology combined with the Transformer Encoder.
Performance: To enhance the model performance, we proposed a deep learning-based model, benefiting from the GCN and Transformer models capabilities. Since there is no any dataset containing both of the continuous sign videos and the corresponding isolated signs, we utilized the datasets in isolated sign language recognition without any pre-processing and concatenated them to make the continuous sign videos. Experimental results confirm the efficiency of the proposed model, the graph structure, and the post-processing methodology for separation of isolated signs in continuous sign videos, obtaining a higher recognition accuracy than the other models. Considering the GCN capabilities combined with the Transformer model, the proposed model in this work has a more discriminative capability to truly classify the similar signs. Furthermore, instead of using handcrafted features, our model in this work uses the deep features obtained from the GCN model. Similar to the configuration of the post-processing methodology used in the previous works^16,28, we used a predefined threshold, 0.51, to accept or reject a recognized class in the current sliding window. Results on two datasets, as shown in Tables 7 and 8, confirmed that if the model faces false recognition, the recognized Softmax outputs for all classes are lower than the predefined threshold. As these tables show, the proposed model obtains the average of recognized Softmax outputs of 0.9955 and 0.6850 on the RKS-PERSIANSIGN and ASLLVD datasets, respectively. We have a higher recognition accuracy on the RKS-PERSIANSIGN dataset than the ASL due to higher video sample instances in each class. This comes from the general capability of deep learning-based models in having a better performance with training using a large amount of data. Furthermore, some false recognition can be found in the similar signs, such as ’Congratulation’, ’Excuse’, ’Upset’, ’Blame’, ’Fight’, ’Competition’. For instance, ’Excuse’ and ’Congratulation’, ’Upset’ and ’Blame’, ’Fight’ and ’Competition’ signs contain many similar frames. Thus, extending the signs samples in these similar classes could lead to learning more powerful features and better representation of sign categories. This also can decrease miss-classifications due to low inter-class variabilities. In future work, we aim to employ our collected dataset, including more realistic continuous sign videos and the corresponding isolated sign videos. Relying on this dataset, we can check the performance of the proposed model to use in a realistic scenario. Furthermore, we aim to apply our non-anatomical graph structure idea to Human Action Recognition.

Acknowledgement

This work has been partially supported by the Spanish projectPID2022-136436NB-I00 and by ICREA under the ICREA Academia programme.

Author contributions

R.R. contributed to the methodology, experiments, writing the original draft, editing and reviewing, visualization. KK contributed to the methodology, editing, and review. S.E. contributed to the methodology, editing and reviewing, and funding.

Data availability

Availability of data and material (data transparency): The datasets (RKS-PERSIANSIGN and ASLVD) analyzed during the current study are available in the (https://doi.org/10.1016/j.eswa.2020.113336) and (https://crystal.uta.edu/~athitsos/projects/asl_lexicon).

Declarations

Competing interests

The authors declare no competing interests.

Consent for publication

All authors confirm their consent for publication.

Sanctions law and regulations

The authors residing in the sanctioned country are preparing this article in their personal capacity, not on behalf of a sanctioned government. All authors are faculty members of the university. They are only concentrated on research and do not have any other collaborations with government.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Rastgoo, R., Kiani, K., & Escalera,S. Diffusion-Based Continuous Sign Language Generation with Cluster-Specific Fine-Tuning and Motion-Adapted Transformer. IEEE Computer SocietyConference On Computer Vision And Pattern Recognition Workshops, pp. 4088–4097, 2025.

2. Li, Y., He, Z., Ye, X., He, Z. & Han, K. Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition. EURASIP Journal On Image And Video Processing78 (2019).

3. Rastgoo, R; Kiani, K; Escalera, S; Athitsos, V; Sabokrou, M. A survey on recent advances in Sign Language Production. Expert Systems With Applications; 2024; 243, 122846.

4. Li, Y; Huang, J; Tian, F; Wang, H; Dai, G. Gesture interaction in virtual reality. Virtual Reality And Intelligent Hardware; 2019; 1, pp. 84-112.

5. Mahmoud, N; Fouad, H; Soliman, A. Smart Healthcare Solutions Using the Internet of Medical Things for Hand Gesture Recognition System. Complex Intell. Syst.; 2021; 7, pp. 1253-1264.

6. Gao, X., Jin, Y., Dou, Q. & Heng, P. Automatic Gesture Recognition in Robot-assisted Surgery with Reinforcement Learning and Tree Search. IEEE International Conference On Robotics And Automation (ICRA), Paris, France, 2020.

7. Nguyen, N; Phan, T; Kim, S; Yang, H; Lee, G. 3D Skeletal Joints-Based Hand Gesture Spotting and Classification. Appl. Sci.; 2021; 11, 4689.1:CAS:528:DC%2BB3MXitlOhu7zN

8. Khan, N; Ghani, M. A Survey of Deep Learning Based Models for Human Activity Recognition. Wireless Personal Communications.; 2021; 120, pp. 1593-1635.

9. Yadav, S; Tiwari, K; Pandey, H; Akbar, S. Skeleton-based human activity recognition using ConvLSTM and guided feature learning. Soft Computing; 2022; 26, pp. 877-890.

10. Al Farid, F; Hashim, N; Abdullah, J; Bhuiyan, M; Isa, W; Uddin, J; Haque, M; Husen, M. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. J. Imaging; 2022; 8, 153. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35735952][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9224857]

11. Rastgoo, R., Kiani, K., Escalera, S. & Sabokrou, M. Sign language production: A review. IEEE Computer Society Conference On Computer Vision And Pattern Recognition Workshops, 2021.

12. Rastgoo, R; Kiani, K; Escalera, S. Video-based isolated hand sign language recognition using a deep cascaded model. Multimedia Tools And Applications; 2020; 79, pp. 22965-22987.

13. Rastgoo, R; Kiani, K; Escalera, S. Multi-modal deep hand sign language recognition in still images using Restricted Boltzmann Machine. Entropy; 2018; 20, pp. 1-15.

14. Rastgoo, R; Kiani, K; Escalera, S. Hand sign language recognition using multi-view hand skeleton. Expert Systems With Applications; 2020; 150, 113336.

15. Rastgoo, R; Kiani, K; Escalera, S. Sign Language Recognition: A Deep Survey. Expert Systems With Applications; 2021; 164, 113794.

16. Rastgoo, R; Kiani, K; Escalera, S. Real-time isolated hand sign language recognition using deep networks and SVD. Journal Of Ambient Intelligence And Humanized Computing; 2022; 13, pp. 591-611.

17. Rastgoo, R., Kiani, K. & Escalera, S. Word separation in continuous sign language using isolated signs and post-processing. Expert Systems With Applications249, Part B, 123695 (2024).

18. Rastgoo, R; Kiani, K; Escalera, S; Sabokrou, M. Multi-Modal Zero-Shot Sign Language Recognition. Expert Systems With Applications; 2024; 247, 123349.

19. Rastgoo, R; Kiani, K; Escalera, S. ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimedia Tools And Applications; 2023; 82, pp. 43781-43796.

20. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. & Yu, P. A Comprehensive Survey on Graph Neural Networks. arXiv:1901.00596, 2019.

21. Meng, L; Li, R. An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network. Sensors; 2021; 21, 1120.2021Senso.21.1120M [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33562715][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7915156]

22. Vazquez-Enrıquez, M., Alba-Castro, J., Docıo-Fernandez, L. & Rodrıguez-Banga, E. Isolated Sign Language Recognition with Multi-Scale Spatial-Temporal Graph Convolutional Networks. IEEE Computer Society Conference On Computer Vision And Pattern Recognition Workshops, 2021.

23. Degardin, B; Lopes, V; Proenc, H. REGINA-Reasoning Graph Convolutional Networks in Human Action Recognition. IEEE Transactions On Information Forensics And Security; 2021; 16, pp. 5442-5451.

24. Duhme, M., Memmesheimer, R. & Paulus, D. Fusion-GCN: Multimodal Action Recognition Using Graph Convolutional Networks. Pattern Recognition, Lecture Notes In Computer Science, 265-281 (2021).

25. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y. & Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition. IEEE Computer Society Conference On Computer Vision And Pattern Recognition, Long Beach, CA, USA, 2019.

26. Neidle, C., Thangali, A. & Sclaroff, S. Challenges in Development of the American Sign Language Lexicon Video Dataset (ASLLVD) Corpus. Proc. Of 5th Workshop On The Representation And Processing Of Sign Languages: Interactions Between Corpus And Lexicon, LREC, 1-8 (2012).

27. Samir Elons, A., Abull-ela, M. & Tolba, M. A proposed PCNN features quality optimization technique for pose-invariant 3D Arabic sign language recognition. Applied Soft Computing13, 1646-1660 (2013).

28. Zhang, H; Sun, Y; Liu, Z; Liu, Q; Liu, X; Jiang, M; Schafer, G; Fang, H. Heterogeneous attention-based transformer for sign language translation. Applied Soft Computing; 2023; 144, 110526.

29. Rastgoo, R., Kiani, K. & Escalera, S. A transformer model for boundary detection in continuous sign language. Mutimedia Tools And Applications, (2024).

30. Abdullahi, S.B, Chamnongthai, K., Bolon-Canedo V. & Cancela, B. Spatial–temporal feature-based End-to-end Fourier network for 3D sign language recognition. Expert Systems with Applications248, (2024).

31. Alamri, F.S., Abdullahi, S.B., Rehman Khan, A. & Saba, T. Enhanced Weak Spatial Modeling Through CNN-Based Deep Sign Language Skeletal Feature Transformation. IEEE Access12, pp. 77019 - 77040, (2024).

32. Abdullahi & Chamnongthai, K. American Sign Language Words Recognition Using Spatio-Temporal Prosodic and Angle Features: A Sequential Learning Approach. IEEE Access10, pp. 15911 - 15923, (20242).

33. Abdullahi, Chamnongthai, K., Gabralla, L.A. & Chiroma, H. Fign-Net: Depth Sensor Aggregated Frame-Based Fourier Network for Sign Word Recognition. IEEE Sensors Journal24, pp. 37630 - 37645, (2024).

34. Cao, Z; Hidalgo, G; Simon, T; Wei, S; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions On Pattern Analysis And Machine Intelligence; 2021; 43, pp. 172-186. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31331883]

Word count: 6957

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

A non-anatomical graph structure for boundary detection in continuous sign language

Content area

Abstract

Full text