Content area
Speech impediments affect verbal and nonverbal communication, leading individuals to rely on sign language and alternative methods. However, non-signers struggle to communicate due to a lack of sign language knowledge. Recent advancements in deep learning and computer vision have improved gesture recognition, enabling the development of innovative solutions for sign language translation. This project proposes a computer vision-based deep learning application that translates sign language gestures into text, enhancing communication between signers and non-signers. It uses video sequences to extract spatial and temporal information, employing a Convolutional Neural Network (CNN) for depth and point data processing, along with a Gated Recurrent Unit (GRU) for improved temporal feature extraction. Temporal tokenization further refines feature representation, ensuring efficient resource utilization. The system is trained on the Word-Level American Sign Language (WLASL) dataset, the largest publicly available ASL dataset, containing over 2,000 words signed by more than 100 individuals. The model accurately recognizes 20 gestures with 94% accuracy. The final implementation is a web application that delivers real-time text translation, fostering seamless communication between signers and non-signers and addressing accessibility challenges for individuals with speech impairments.
ABSTRACT
Speech impediments affect verbal and nonverbal communication, leading individuals to rely on sign language and alternative methods. However, non-signers struggle to communicate due to a lack of sign language knowledge. Recent advancements in deep learning and computer vision have improved gesture recognition, enabling the development of innovative solutions for sign language translation. This project proposes a computer vision-based deep learning application that translates sign language gestures into text, enhancing communication between signers and non-signers. It uses video sequences to extract spatial and temporal information, employing a Convolutional Neural Network (CNN) for depth and point data processing, along with a Gated Recurrent Unit (GRU) for improved temporal feature extraction. Temporal tokenization further refines feature representation, ensuring efficient resource utilization. The system is trained on the Word-Level American Sign Language (WLASL) dataset, the largest publicly available ASL dataset, containing over 2,000 words signed by more than 100 individuals. The model accurately recognizes 20 gestures with 94% accuracy. The final implementation is a web application that delivers real-time text translation, fostering seamless communication between signers and non-signers and addressing accessibility challenges for individuals with speech impairments.
Keywords: Sign Language Translation, Computer Vision, Deep Learning, Gesture Recognition, CNN.
1. INTRODUCTION: The process of converting the signs and gestures shown by the user into text is called sign language recognition. It bridges the communication gap between people who cannot speak and the public. Image processing algorithms along with neural networks is used to map the gesture to appropriate text in the training data and hence raw images/videos are converted into respective text that can be read and understood. Dumb people are usually deprived of normal communication with other people in the society. It has been observed that they find it difficult at times to interact with normal people with their gestures, as only a very few of those are recognized by most people. Since people with hearing impairment or deaf people cannot talk like normal people so they have to depend on some sort of visual communication in most of the time. Sign Language is the primary means of communication in the deaf and dumb community. As like any other language it has also got grammar and vocabulary but uses visual modality for exchanging information The Sign Language Dynamic Gesture Recognition System leverages deep learning and computer vision to translate sign language gestures into text, bridging the communication gap between signers and non-signers. Using CNNs for spatial feature extraction and GRUs for temporal sequence analysis, the system efficiently processes video sequences. Trained on the WLASL dataset with over 2000 words, it accurately recognizes 20 dynamic gestures with 94% accuracy. The integration of temporal tokenization enhances efficiency, making the system reliable across different conditions. Delivered as a web application, it fosters inclusivity, providing an effective assistive tool for the hearing-impaired community.
2. LITURETURE SURVEY:
Gesture recognition and sign language detection are essential components for enhancing human-computer interaction (HCT) and increasing accessibility for individuals with hearing and speech impairments. These technologies facilitate meaningful communication, bridging the gap between users and machines, and are becoming increasingly important across a variety of applications, from mobile devices to virtual reality platforms. The integration of deep learning has led to significant progress in this area, particularly through the use of Convolutional Neural Networks (CNNs), which have shown remarkable success in recognizing static hand gestures. CNNs are particularly skilled at automatically extracting hierarchical features from images, allowing for effective recognition across various environments and conditions (Kumar et al., 2019) [5].
Several studies have examined the capabilities of CNNs in recognizing static gestures, with Aksan et al. (2019) [1] demonstrating the effectiveness of CNNs for American Sign Language (ASL) recognition. Their findings highlight the importance of a well-structured dataset and a strong architectural framework, showing that high accuracy can be achieved in identifying static hand shapes. However, the challenges associated with recognizing dynamic sign language are more complex, as they necessitate the interpretation of gesture sequences over time. To overcome these challenges, researchers have created hybrid models that integrate CNNs with Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks. These models are adept at capturing the temporal dependencies present in signing, resulting in improved performance for continuous sign language recognition (Igbal et al., 2021) [4]
3. SYSTEM ANALYSIS
3.1 EXISTING SYSTEM
An essential component of human existence is communication, as discussed in this chapter. It inspires a guy to communicate his thoughts, emotions, and messages by speaking, writing, or any other means. For those with hearing loss or other communication impairments, gesture-based communication is the primary means of interaction. Talks about some information or message. A person who is hard of hearing or whose speech is impaired must rely on gestures to communicate with others. A Gesture Recognition System is an interface that can detect, follow, and interpret gestures, and then act accordingly. In other words, it eliminates the necessity for customers to use mechanical devices in order to interact with machines (HMI). Methods for recognizing signs might be either image-based or sensor-based. This project makes use of an image-based technique to manage communication via movements of gestures in order to identify and follow the indicators and then transform them into the relevant discourse and content.
A well-organized algorithm is followed by the Gesture Recognition System to facilitate efficient communication. To begin, a camera is used to record pictures of hand gestures. To improve the accuracy of identification, these photographs are pre-processed by reducing noise, shrinking them, and removing the backdrop. Different gestures are identified by extracting key properties including form, contour, and movement. Next, the attributes that were retrieved are used by a trained machine learning model to categorize these movements.
LIMITATION OF EXISTING SYSTEM
* High sensitivity to lighting and background variations.
* Difficulty differentiating subtle hand movements.
* Complex hand and facial expression analysis, potential for misinterpretations due to regional sign language variations.
3.2 PROPOSED SYSTEM
The proposed system for sign language translation aims to integrate Convolutional Neural Networks (CNNs) with Natural Language Processing (NLP) techniques to provide a more robust and efficient gesture-to-word translation. Initially, CNNs will be used to capture and recognize hand gestures from input images or videos, detecting the underlying patterns in the gestures. Once the gestures are recognized, NLP will be employed to process and convert the identified gestures into coherent text or speech. This combination of CNNs and NLP will enhance the system's ability to not only recognize individual signs but also understand the context and structure of full sentences in sign language. By incorporating NLP, the system will be able to handle more complex sentence structures, improve accuracy, and provide a smoother translation experience, facilitating better communication for hearing-impaired individuals across various social, educational, professional, healthcare, entertainment, legal, and multilingual interaction settings worldwide.
4. SYSTEM ARCHITECTURE
5. METHODOLOGY
The proposed system utilizes a deep learning-based approach to recognize dynamic sign language gestures, combining Convolutional Neural Networks (CNNs) for spatial feature extraction with Gated Recurrent Units (GRUs) for temporal sequence modeling. The process begins with data collection and preprocessing, using the Word-Level American Sign Language (WLASL) dataset, which includes over 2,000 words signed by more than 100 individuals. Video frames are extracted, resized, and preprocessed to remove noise and standardize inputs for consistent recognition.
Next, CNNs analyze the spatial characteristics of hand gestures, identifying key features, while GRUs capture the temporal dependencies across video sequences to enhance recognition accuracy. To improve efficiency, temporal tokenization is applied, refining feature representation and optimizing resource utilization. The model undergoes supervised training With cross-entropy loss and Adam optimization, incorporating data augmentation techniques to improve generalization and robustness across different environments.
Once trained, the model is integrated into a web-based application that enables real-time translation of sign language gestures into text. This facilitates seamless communication between signers and non-signers, enhancing accessibility for individuals with hearing impairments and bridging the communication gap through an intuitive and efficient system.
6.MODULES
1. Dataset
A labelled dataset of hand gestures is collected using a camera. It includes images of different gestures with variations in lighting, angles, and backgrounds for accurate recognition.
2. Pre-processing
Captured images are resized, converted to grayscale, and undergo noise reduction. Background removal and segmentation techniques enhance gesture visibility, ensuring consistent input for the recognition model.
3. Feature Extraction
Key hand features like contour, shape, and fingertip positions are extracted. Using MediaPipe or Open CV, landmarks are detected to differentiate gestures based on structural and positional variations.
4. Model Training
A Convolutional Neural Network (CNN) is trained using gesture images. The dataset is split into training and validation sets, and the model learns patterns for accurate classification.
5. Gesture Classification
The trained model predicts gestures in real-time. When a new hand gesture is detected, it matches the learned patterns and classifies it into a predefined category.
6. Text & Speech Conversion
Recognized gestures are mapped to corresponding text. Text-to-speech (TTS) technology converts the output into speech, enabling communication for speech- and hearing-impaired individuals.
7. Real-time Testing
The system is deployed with a webcam, recognizing hand gestures live. The accuracy is evaluated, and misclassifications are corrected by improving preprocessing techniques and model training
7. RESULTS:
In the results and discussion section of a sign language dynamic gesture recognition system leveraging deep learning and computer vision, the system demonstrated significant improvements in recognizing a wide range of sign language gestures in real-time the system achieved high accuracy in dynamic gesture recognition, outperforming traditional methods.
The system successfully recognizes sign language gestures using convolutional neural networks (CNNs) and converts them into text format. In the given example, the model correctly identifies the "L" sign and displays it as a character, word, and potentially as part of a sentence. This result demonstrates the effectiveness of CNN-based feature extraction and classification in sign language recognition.
Performance metrics such as accuracy, precision, recall, and F1-score were evaluated across various datasets, showing that the deep learning models were able to generalize well across different sign language variations. The system was particularly effective in capturing temporal dynamics through the use of which enabled it to interpret gestures over time. However, challenges such as the need for large, diverse datasets and handling occlusions or background noise were noted, suggesting areas for further refinement. Overall, the results indicate that deep learning and computer vision provide a robust foundation for developing accurate and efficient sign language recognition systems.
8. CONCLUSION
In this study, a Deep CNN-based method for classifying and recognizing sign language using computer vision is introduced. In contrast to any other approaches, this one produce significantly fewer false positives and improved accuracy. Dynamic gesture recognition, which is currently being worked on, is another extension that might be made to this work. The information used was the one specifically created for this thesis and recorded, as noted in section above. There were 150 of the most prevalent ASL signs in total. Prior to training and testing, it was enhanced to add more videos every class. The data set was arbitrarily divided into 30 percent for testing, 20 for validation and 50 percent for training. The primary aim of the project was to develop a platform that could convert dynamic sign language into equivalent word indicators that are dynamic signs to teach users the foundations of sign language. A plan of assessment was made by the researchers and several assessments were carried out to ensure the significance meant for users who are not signers. In terms of test results, it received outstanding marks. Impact of the system on usability and learning. One of the goals of the study has been to match or indeed surpass the accurateness of the studies that were presented utilizing deep learning by the time the training stage in the development was complete. In a short period of time, the system was able to perform dynamic word recognition with testing accuracy of 94% and training accuracy of 99%.
FUTURE SCOPE
Future enhancements include real-time dynamic gesture recognition, expanded sign language datasets, integration with wearable devices, and a multimodal communication system. Optimizing model efficiency, developing a mobile app, enabling sign-to-speech conversion, and personalized learning will improve accessibility. Integration with social media and virtual assistants ensures wider usability in diverse environments.
REFERENCES
[1]. Aksan, E., Karam, L. J., & Bozdag, F. (2019). American Sign Language recognition using convolutional neural networks. IEEE Access, 7, 115878-115887. https://doi.org/10.1109/ACCESS.2019.2931812
[2]. Chollet, F. (2018). Deep learning with Python. Manning Publications.Jiang, H., et al. (2021). Efficient realtime object detection using YOLO for mobile and embedded devices. IEEE Access, 9, 114076-114087.
[3]. Gupta, V., Kumar, А., & Gupta, М. (2017). A survey of gesture recognition techniques. Journal of Computer and Communications, 5(3), 57-66. https://doi.org/10.4236/jcc.2017.LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
[4]. Iqbal, A., Zia, K., & Qadir, M. (2021). A hybrid CNN-RNN model for continuous sign language recognition. Applied Sciences, 11(1), 163. https://doi.org/10.3390/app11010163
[5]. Kumar, S., & Murtaza, M. (2019). A comprehensive review of deep learning techniques for gesture recognition. International Journal of Computer Applications, 182(12), 21-27. https://doi.org/10.5120/ijca2019918664
[6]. Niu, Y., Wu, J., & Xu, Y. (2021). Efficient real-time hand gesture recognition based on depth information. Sensors, 21(3), 1023. https://doi.org/10.3390/s21031023
[7]. Tan, Z., Xie, W., & Zhang, Y. (2023). Multimodal gesture recognition for human-computer interaction: A review. IEEE Transactions on Human-Machine Systems. https://doi.org/10.1109/THMS.2023.1234567
[8]. Wang, S., Wang, Z., & Zhang, X. (2020). User-centered design for sign language recognition systems: A review. Universal Access in the Information Society, 19(2), 357-373. https://doi.org/10.1007/s10209-019- 00636-2.
[1] T Youssef M. WiGest demo Abdelnasser H, Harras KA." A ubiquitous Wi- Fi- based gesture recognition system". IEEE INFOCOM. 2015; 2015 August: p. 17-18, 2015
[2] Van Merri enboer B. Bahdanau D. Bengio Y Cho, K. "on the properties of neural machine translation: Encoderdecoder approaches". Comput. Lang.arxiv:1409.1259 (2015), 2015.
[3] Pigou, L., Dieleman, S., Kindermans, PJ., Schrauwen, B. (2015)." Sign Language Recognition Using Convolutional Neural Network"sECCV, vol 8925,2015
[4] Wang M Li H. Guo D, Zhou W." sign language recognition based on adaptive hmms with data augmentation.". International Conference on Image Processing, ICIP,Proceedings p. 2876-2880, 2016.
[5] Jaward MH Jin CM, Omar Z." a mobile application of american sign language translation via image processing algorithms". 2016 IEEE Region 10 Symposium, TENSYMP 2016. 2016;: p.104-109, 2016.
[6] C. Jose L. Flores; A. E. Gladys Cutipa; R. Lauro Enciso. " Application of convolutional neural networks for static hand gestures recognition under different invariant features.". 2017 IEEE 24th International Congress on Electronics, Electrical Engineering and Computing, INTERCON, 2017.
[7] Arzuaga E. Joshi A, Sierra H. "american sign language translation using edge detection and cross correlation". 2017 IEEE Colombian Conference on Communications and Computing, COLCOM ,2017
[8] Roy PP Dogra DP. Kumar P, Gauba H." Coupled HMM-based multi- sensor data fusion for sign language recognition". ACM, 2017.
[9] Xu P."A real time hand gesture recognition and human computing interaction system", Cornell University, 2017
[10] Kishore PVV Rao GA."sign language recognition system simulated for video captured with smart phone front camera". International Journal of Electrical and Computer Engineering. 2017, 2017.
[11] Ruilong Chen and Lyudmila Mihaylova. "American sign language pos-ture understanding with deep neural networks.". IEEE, 2018.
[12] Md Asif Jalal, Ruilong Chen and Lyudmila Mihaylova. "American sign language posture understanding with deep neural networks". 21st International Conference on Information Fusion (FUSION), IEEE, 2018.
[13] Lahiani, Houssem & Neji, Mahmoud. (2018). Hand gesture recognition method based on HOG-LBP features for mobile devices. Procedia Computer Science. 126. 254-263. 2018.
[14] Aly, Walaa, Saleh K. H. Aly and Sultan Almotairi. "UserIndependent American Sign Language Alphabet Recognition Based on Depth Image and PCANet Features." IEEE Access 7 (2019): 123138-123150, 2019.
[15] Tayyip Ozcan and Alper Basturk." transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition". Neural Computing and Applications, 2019.
[16] Vinay Jain Pratibha Pandey." hand gesture recognition using discrete wavelet transform and hidden markov models". IEEE, 2019.
[17] T. Islam A. Ghosh R. Chakraborty A. I. Khan C. Shahnaz S." realtime American sign language recognition using skin segmentation and image category classification with convolutional neural network and deep learning". page 1168 1171.IEEE, 2019.
[18] Shaon Bandyopadhyay." study on Indian sign language recognition using deep learning approach". IJSRD - International Journal for Scientific Research Development, 2020.
Copyright Kohat University of Science and Technology (KUST) 2025