https://doi.org/10.31449/inf.v49i14.7596 Informatica 49 (2025) 119–126 119 Research on Sign Language Recognition for Hearing-Impaired People Through the Improved YOLOv5 Algorithm Combining CBAM with Focal CioU Niqin Jing * , Yi Hu, Yanxia Wang 1 Beijing Polytechnic, Beijing 100176, China E-mail: jingnqniq@hotmail.com * Corresponding author Keywords: deep learning, hearing-impaired people, sign language recognition, YOLOv5 Received: November 14, 2024 Sign language recognition has become increasingly important as the number of hearing-impaired people increases. This paper optimized the you only look once version 5 (YOLOv5) algorithm from perspectives of attention mechanism and loss function. The convolutional block attention module (CBAM) was added to the network, and the original intersection over union (IoU) loss function was improved to focal complete IoU (CIoU). Experimental analyses were performed on the American Sign Language (ASL) dataset in the Windows 10 environment. Moreover, the ten-fold cross-validation was used. The experiments found that adding the CBAM to the neck part of YOLOv5 showed the most effective sign language recognition results. The improved algorithm showed improvements of 0.95% in P value, 4.19% in R value, and 2.66% in mean average precision (mAP) compared to the baseline algorithm. When comparing different loss functions, the focal CIoU performed the best. Compared with other recognition algorithms, the improved YOLOv5 algorithm performed better in sign language recognition, achieving P value, R value, and mAP of 93.26%, 96.77%, and 98.12%, respectively. These results verify the reliability of the improved YOLOv5 algorithm in sign language recognition for hearing-impaired people. It can be applied in practice. Povzetek: Članek raziskuje prepoznavanje znakovnega jezika za naglušne osebe z izboljšanim algoritmom YOLOv5, ki združuje CBAM z Focal CioU. Avtorji so optimizirali algoritem YOLOv5 z dodajanjem pozornostnega mehanizma CBAM in izboljšanjem funkcije izgube IoU na Focal CIoU. 1 Introduction Sign language is a main communication tool for hearing- impaired people [1]. The study of sign language has gained more attention as the number of people with hearing impairments continues to increase. Sign language, a type of body language, conveys complex meanings through gestures, which can be understood after specialized learning. However, the general population has limited exposure to sign language, posing significant challenges for hearing-impaired individuals in communicating with the outside world. With the continuous advancement of computer technology, using computers to achieve sign language recognition can provide reliable assistance for communication among the hearing-impaired population [2]. Sign language recognition can be categorized into the recognition of static sign language images and the recognition of dynamic sign language videos, which have been extensively investigated [3]. FAl Rafi et al. [4] studied the identification of Bengali sign language using pre-trained MobileNetV2 and a conditional deep convolutional generative adversarial network, achieving a test accuracy of 94.74%. Takahashi et al. [5] proposed a network that combined a 3D convolutional neural network (CNN) with a Transformer for isolated sign language identification. They demonstrated its effectiveness through experiments on LSA64. Yu et al. [6] explored Chinese sign language identification based on wearable sensors and used a deep belief network to recognize captured electromyography, accelerometer, and gyroscope signals, achieving favorable recognition accuracy. Joshi et al. [7] studied dynamic Gujarati sign language recognition. They extracted features based on the Mediapipe algorithm, established a deep learning model with six layers based on long short- term memory, and found high accuracy through experiments. Wang et al. [8] developed a gesture recognition method based on the Transformer model and trained it on a large corpus. Through experiments, it was found that the average word error rate of this method was 21.6%. Sharma et al. [9] proposed an attention-based real- time embedded long short-term memory (LSTM) for dynamic sign language identification and achieved a real- time recognition rate of 99.7%. Kourbane et al. [10] put forward a new deep learning-based framework to achieve hand pose estimation. Through extensive experiments on two datasets, they found that this method was superior to the existing methods. This paper primarily focused on the recognition of static sign language images. To address challenges such as feature extraction difficulties and poor recognition performance of sign language images and to further enhance the recognition performance of sign language images, an optimized you only look once version 5 (YOLOv5) model was developed based on deep learning. The effectiveness of this model in sign language recognition was verified through experiments, offering a more accurate approach for recognizing static sign language images. Moreover, the method enhanced 120 Informatica 49 (2025) 119–126 N. Jing et al. communication efficiency between hearing-impaired people and the outside world. The results also provide theoretical support for further utilization of deep learning methods. 2 Related works The improved YOLOv5 algorithm developed in this paper was compared with some existing target recognition methods, and the following results were obtained. Table 1: Comparison of related works. P/% R/% mAP@0.5/ % Faster region-CNN [11] 80.12 ± 1.87 67.89 ± 1.65 79.84 ± 1.77 YOLOv3 [12] 87.77 ± 2.01 80.12 ± 1.77 90.31 ± 2.01 YOLOv4 [13] 88.05 ± 1.97 83.25 ± 1.56 92.56 ± 2.33 YOLOv5 88.12 ± 2.07 90.33 ± 1.64 94.21 ± 2.14 MobileNetV 2 [14] 91.12 ± 2.56 81.94 ± 1.82 91.27 ± 2.05 ShuffleNetV 2 [15] 91.08 ± 2.33 82.11 ± 2.01 91.26 ± 2.17 Improved YOLOv5 93.26 ± 2.77 96.77 ± 2.68 98.12 ± 2.32 The results in Table 1 verified the reliability of the improved YOLOv5 algorithm in recognition of static sign language images. Compared with the existing target detection methods, in this paper, based on the traditional model, the improvement of the detection performance was achieved through the introduction of the attention mechanism and the optimization of the loss function, enabling the model to pay more attention to the samples that are difficult to classify. 3 Improved YOLOv5 algorithm 3.1 Sign language and deep learning Hearing impairment is a global health issue [16]. Based on the data published by the China Disabled Persons’ Federation, the number of hearing-impaired people in China reached 20.54 million in 2010, accounting for the most significant proportion of disabilities (24.16%). Among them, children have a relatively high prevalence of Grade 1 and Grade 2 hearing disabilities. Moreover, at least 20,000 newborns are affected by hearing impairment annually, with a prevalence rate of 0.1%-0.3% for congenital hearing impairment in newborns and 0.27% for children under five years old. The hearing-impaired people usually use sign language for communication. However, sign language interpreters are often necessary for effective communication between the general population and people who rely on sign language. Unfortunately, the severe shortage of such interpreters cannot meet the communication needs of these people. As technology develops, artificial intelligence-based sign language recognition has emerged as a prominent solution to address hearing-impaired people’s communication requirements. As a non-verbal communication, sign language does not rely on auditory language but utilizes a unique grammatical structure. It is the visual language for individuals with hearing impairments and plays a crucial role in communication [17]. Sign language recognition can aid hearing-impaired people in communicating with the society. It can be categorized into static and dynamic sign language recognition. The former involves identifying gestures in images and has wide applications in hospitals and banks. The latter refers to a series of movements within a short time. Hand trajectory is combined with position for accurate recognition; therefore, it is more complex than static gestures. In recognizing static sign language images, rich gesture features are extracted from them, and a classifier is used for accurate recognition. There are two main approaches to feature extraction. The first approach involves extracting visual features from sign language images pre-processed by denoising and segmentation [18]. Sign language recognition can be achieved using methods like support vector machines (SVM) or extreme gradient boosting (XGBoost), which learn a limited number of features. The second approach is based on deep learning, which can learn advanced features from images and achieve faster training. It has shown excellent performance in tasks like image identification and target detection [19], making it increasingly popular in sign language recognition [20]. A convolutional neural network (CNN) is a basic deep learning approach [21]. Image features are extracted by convolution. The convolution operation is conducted on the input feature maps to get new feature maps. The formula for convolution operation is: 𝑌 𝑘 𝑚 = 𝑓 (∑ 𝑊 𝑗𝑘 𝑚 ∗ 𝑌 𝑗 𝑚 −1 + 𝑏 𝑘 𝑚 𝑗 ∈𝑇 ), where 𝑇 is the set of feature 𝑦 𝑗 𝑚 −1 in 𝑚 − 1, 𝑊 𝑗𝑘 𝑚 is the weight of the convolution kernel, 𝑏 𝑘 𝑚 is the bias, and ∗ is the convolution operation. The pooling layer reduces dimensionality through feature selection, which reduces the computation amount and avoids overfitting. Generally, there are two operations: maximum pooling and average pooling (Figure 1). 1 6 3 5 2 7 9 3 1 2 2 1 6 3 5 4 4 5 3 3 7 9 6 5 Mean pooling Maximum pooling Figure 1: An example of pooling operations. For the features that are learned by convolution and pooling, the CNN converts them into classification results in the output layer through a fully connected layer. A Research on Sign Language Recognition for Hearing-Impaired… Informatica 49 (2025) 119–126 121 Dropout layer is usually added to the network to avoid overfitting: 𝑦 ̃ (𝑙 ) = 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝 ) × 𝑦 (𝑙 ) , 𝑧 𝑖 (𝑙 +1) = 𝑤 𝑖 (𝑙 +1) 𝑦 ̃ (𝑙 ) + 𝑏 𝑖 (𝑙 +1) , 𝑦 𝑖 (𝑙 +1) = 𝑓 (𝑧 𝑖 (𝑙 +1) ), where 𝑦 (𝑙 ) stands for the output vector of the 𝑙 layer, 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝 ) is the Bernoulli function, 𝑤 𝑖 (𝑙 +1) and 𝑏 𝑖 (𝑙 +1) are the weight and bias of the 𝑙 + 1 layer, and 𝑧 𝑖 (𝑙 +1) is the input vector of the 𝑙 + 1 layer. In CNN, nonlinear factors are introduced through activation functions to enhance the fitting ability of the network. Commonly used activation functions are: (1) sigmoid function: 𝑦 = 1 1+𝑒 −𝑥 , (2) Tanh function: 𝑦 = 𝑒 𝑥 −𝑒 −𝑥 𝑒 𝑥 +𝑒 −𝑥 , (3) rectified linear unit (ReLU) function: 𝑦 = 𝑚𝑎𝑥 {0, 𝑥 } = { 𝑥 , 𝑥 ≥ 0 0, 𝑥 < 0 . 3.2 YOLOv5 algorithm Based on a CNN, the YOLO algorithm has various versions, such as YOLOv2 and YOLOv3. Among these versions, the most widely used is the YOLOv5 algorithm [22], which has a more lightweight structure and provides outstanding advantages in detection speed and accuracy. The YOLOv5 algorithm has five versions, namely n, s, m, l, and x, which differ in width and depth. The YOLOv5s algorithm is the lightest version and is particularly suitable for mobile deployment. Thus, this paper presents a sign language recognition method for hearing-impaired people based on the YOLOv5s algorithm. The YOLOv5 network can be segmented into the following parts. (1) Input end Mosaic data augmentation is employed to expand the dataset and increase the diversity of the data. Moreover, the scaling of the input image is adaptively adjusted to enhance recognition accuracy and efficiency. (2) Backbone network ① Focus module: The input image is sliced to get multiple low-resolution sub-images to reduce the amount of computation. ② Cross stage partial (CSP) network module: Convolution operation is combined with residual components to enhance the feature extraction capability of the model. (3) Neck network ① Spatial pyramid pooling (SPP) structure: The feature maps of different sizes are divided into four blocks, which are subjected to maximum pooling of 1×1, 5×5, 9×9, and 13×13, and then the resulting feature maps are spliced and input to the next layer. ② Feature pyramid network (FPN) and path aggregation network (PAN) structures: They have multiple bottom-up and top-down paths to acquire more information. (4) Head network The feature maps output from the backbone and neck networks are post-processed to obtain the final recognition results. The binary cross entropy loss (BCELoss) is used as the classification loss function: 𝐵𝐶𝐸𝐿𝑜𝑠𝑠 = − 1 𝑛 ∑[𝑦 𝑛 ln 𝑥 𝑛 + (1 − 𝑦 𝑛 )ln(1 − 𝑥 𝑛 )], where 𝑥 𝑛 is the first probability of the 𝑛 -th sample and 𝑦 𝑛 is the binary label value (0 or 1). The complete intersection over union (CIoU) loss is used as the bounding box loss function: 𝐿 𝐶𝐼𝑜𝑈 = 1 − 𝐼𝑜𝑈 + 𝜌 2 (𝑏 ,𝑏 𝑔𝑡 ) 𝑐 2 + 𝛼𝑣 , 𝛼 = 𝑣 (1−𝐼𝑜𝑈 )+𝑣 , 𝑣 = 4 𝜋 2 (arctan 𝑤 𝑔𝑡 ℎ 𝑔𝑡 − arctan 𝑤 ℎ ) 2 , where 𝐼𝑜𝑈 is the IoU between the predictive box and true box, 𝜌 2 (𝑏 , 𝑏 𝑔𝑡 ) is the Euclidean distance between predictive box 𝑏 and true box 𝑏 𝑔𝑡 , 𝑐 is the diagonal length of the minimum outer rectangle of the predictive box and true box, 𝛼 is the weighting function, 𝑣 is the width-to- height ratio similarity, 𝑤 𝑔𝑡 and ℎ 𝑔𝑡 are the width and height of the predictive box, 𝑤 and ℎ are the width and height of the predictive box. The YOLOv5 algorithm also employs non-maximum suppression (NMS) as a post-processing technique to eliminate duplicate recognition results and filter out the best detection box: 𝑠 𝑖 = { 𝑠 𝑖 , 𝑖𝑜𝑢 (𝑀 , 𝑏 𝑖 ) < 𝑁 0, 𝑖𝑜𝑢 (𝑀 , 𝑏 𝑖 ) ≥ 𝑁 , where 𝑠 𝑖 is the confidence level of the 𝑖 -th predictive box, 𝑀 is the current predictive box with the highest confidence level, 𝑏 𝑖 is the 𝑖 -th predictive box, and 𝑁 is the IoU threshold. 3.3 Improved YOLOv5 algorithm This paper optimized the YOLOv5 algorithm in terms of both the attention mechanism and the loss function to further improve its performance in sign language recognition. Adding the attention mechanism can make the model allocate greater focus towards essential parts and thus improve the recognition performance, which has promising applications in machine vision, natural language processing, and other fields [23]. This paper adds the convolutional block attention module (CBAM) [24] to the YOLOv5 algorithm to enhance the network’s generalization ability. The CBAM module has been well applied in image recognition tasks, such as remote sensing images [25] and radar images [26]. The structure of CBAM is presented in Figure 2. Input feature Channel attention module Spatial attention module Refined feature Figure 2: CBAM structure. For feature map 𝐹 ∈ 𝑅 𝐶 ×𝐻 ×𝑊 , 𝐶 is the number of channels, and 𝐻 and 𝑊 are length and width. The formula for channel attention is: 𝑀 𝐶 (𝐹 ) = 𝜎 (𝑊 1 (𝑊 2 (𝐹 𝑎 𝑣𝑔 𝐶 )) + 𝑊 1 (𝑊 2 (𝐹 𝑚𝑎𝑥 𝐶 ))), 122 Informatica 49 (2025) 119–126 N. Jing et al. where 𝐹 𝑎𝑣𝑔 𝐶 and 𝐹 𝑚𝑎𝑥 𝐶 are feature maps after mean pooling and maximum pooling, 𝜎 is the sigmoid activation function, 𝑊 1 and 𝑊 2 are weights. The input of spatial attention is the multiplication result of 𝑀 𝐶 and original feature map 𝐹 . The calculation formula is: 𝑀 𝑆 (𝐹 𝑆 ) = 𝜎 (𝑓 7×7 ([𝐹 𝑎𝑣𝑔 𝑆 ; 𝐹 𝑚𝑎𝑥 𝑆 ])), 𝐹 𝑆 = 𝑀 𝐶 ⨂𝐹 . The computation formula of the output feature map is: 𝑀 𝐹 (𝐹 ) = 𝑚𝑎𝑥 (0, ( 𝑀 𝑆 ⨂𝐹 𝑆 )⨁𝐹 ). In sign language recognition, CIoU loss may not fully take into account the diversity of sign language in shape. In order to better focus on the difficult-to-recognize gestures, this paper introduces focal loss [27] as a loss function. Focal loss can assign higher weights to samples that are difficult to classify. The combination of focal loss with CIoU enables it to pay better attention to difficult samples, reduce missed detections, and improve detection performance. 𝐿 𝐹𝑜𝑐𝑎𝑙𝐶𝐼𝑜𝑈 = (1 − 𝐼𝑜𝑈 + 𝜌 2 (𝑏 ,𝑏 𝑔𝑡 ) 𝑐 2 + 𝛼𝑣 ) 𝛾 , where IoU refers to the intersection over union between the prediction box and the true box, 𝜌 2 (𝑏 , 𝑏 𝑔𝑡 ) is the Euclidean distance between prediction box 𝑏 and true box 𝑏 𝑔𝑡 , 𝑐 is the diagonal length of the minimum enclosing rectangle of the prediction box and the true box, 𝛼 is the weight function, 𝑣 refers to the aspect ratio similarity, and 𝛾 is an adjustment factor to mitigate the effect of sample imbalance on identification, 1.5 here. 4 Results and analysis 4.1 Experimental setup The experiment was conducted in a Windows 10 environment, and the specific configuration is displayed in Table 2. Table 2: Experimental environment. Configuration Operating system Windows 10 Compute unified device architecture 11.0 Programming language Python 3.7 Deep learning framework PyTorch 1.7.0 Central processing unit Intel(R) Xeon(R) Gold 5218 Graphic processing unit Tesla T4 YOLOv5 version YOLOv5 v6.1 Image processing library OpenCV 4.1.2; Pillow 8.2.0 Table 3 presents the parameter settings in the improved YOLOv5 algorithm. Table 3: The training parameters of the improved YOLOv5 algorithm. Numerical value IoU threshold 0.5 Epochs 200 Batch size 16 Optimizer Stochastic gradient descent Initial learning rate 0.01 Weight decay factor 0.0005 The following indicators were used to evaluate the effectiveness of sign language recognition: (1) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 +𝐹𝑃 , (2) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 +𝐹𝑁 , (3) 𝑚𝐴𝑃 = ∑ 𝑃 (𝐾 )∆𝑅 (𝐾 ) 𝑁 𝐾 =1 𝐶 . In the above equations, 𝑇𝑃 denotes the quantity of positive samples identified as positive, 𝐹𝑃 is the quantity of negative samples identified as positive, 𝐹𝑁 is the quantity of positive samples identified as negative, 𝑁 is the sample size of the test set, 𝐶 is the number of categories, 𝑃 (𝐾 ) is the 𝑃 value when simultaneously identifying 𝐾 samples, and ∆𝑅 (𝐾 ) is the change of the 𝑅 value when the number of samples to be identified changes from 𝐾 − 1 to 𝐾 . The mean average precision (mAP) when the IoU threshold was 0.5 was used. Static sign language recognition has significant social significance in practice and can provide convenience for hearing-impaired people. Therefore, this paper mainly studied static sign language identification. The static sign language images used were from the American Sign Language (ASL) dataset [28]. This dataset contains 26 English letters and has been widely applied in the current research of static sign language recognition. Moreover, it involved 36 sign languages: space, del, nothing, and the letters A-Z, and included 87,000 images in a size of 200×200. Thirty thousand images were randomly selected for the experiments. Ten-fold cross-validations were used, and the results were expressed as mean ± standard deviation. Moreover, statistical tests and analyses were conducted in the SPSS26.0 software. 4.2 Experimental results In order to determine the optimal location of the CBAM in the YOLOv5 network, the effects of different CBAM locations on sign language recognition were compared. The YOLOv5 algorithm without CBAM was used as a baseline model, and the CBAM was added at the following locations: (1) after the CSP structure of the backbone network, (2) after the SPP structure of the neck network, (3) before the convolutional structure of the head network. It is assumed that if CBAM is added after the SPP structure of the neck network, it can pay more attention to the easily ignored targets. Table 4: Effects of different locations of CBAM on sign language recognition. P/% R/% mAP@0.5 % Base 88.12 ± 2.74 90.33 ± 3.01 94.21 ± 2.81 Backbon e 88.97 ± 2.78 90.59 ± 3.98 94.77 ± 3.68 Research on Sign Language Recognition for Hearing-Impaired… Informatica 49 (2025) 119–126 123 Neck 89.07 ± 3.01 94.52 ± 4.27 96.87 ± 3.56 Head 81.17 ± 2.89 95.12 ± 3.64 95.07 ± 3.62 F value 3.695 3.841 3.261 P value 0.001** 0.002** 0.004** Note: **: p < 0.01 As shown in Table 4, the addition of the CBAM at different locations within the YOLOv5 network had an impact on sign language recognition results. For instance, when the CBAM was added to the head section, the P value was the lowest, only 81.17%, but the R value was improved to 95.12±3.64%, and the final mAP value was 95.07±3.62%. Moreover, when the CBAM was added in the neck section, the P value was the highest, the R value was second only to the head, and the mAP value was also the highest, reaching 96.87 ± 3.56%. It was found through comparison that different locations of CBAM led to significant differences in sign language recognition results (p < 0.01). The performance was the best when the CBAM module was added to the neck part. In order to assess the optimization effectiveness of focal CIoU on the YOLOv5 algorithm, the loss function, including IoU, generalized IoU (GIoU) [29], distance IoU (DIoU) [30], CIoU, and focal CIoU, were respectively used in the original YOLOv5 algorithm. Table 5 shows that the traditional YOLOv5 algorithm (with the IoU loss function) had a low P value, R value, and mAP, suggesting a poor performance in sign language recognition. However, after improving the loss function, the sign language recognition performance of the YOLOv5 algorithm showed an improvement. It was found through comparison that different loss functions resulted in significant differences in sign language recognition results (p < 0.01), and the performance was best when focal CIoU was used. Table 5: Effects of loss function on handwriting recognition. P R mAP@0.5% IoU 88.12 ± 2.74 90.33 ± 3.01 94.21 ± 2.81 GIoU 89.07 ± 2.68 90.56 ± 2.87 94.33 ± 2.79 DIoU 90.12 ± 2.77 91.88 ± 2.93 94.95 ± 2.87 CIoU 90.54 ± 2.76 92.37 ± 2.84 95.12 ± 3.12 Focal CIoU 91.67 ± 2.61 94.87± 3.21 96.64 ± 3.07 F value 3.564 3.528 3.425 P value 0.002** 0.007** 0.009** Note: **: p < 0.01 Ablation experiments were performed on the improved algorithm to analyze the effect of various module improvements on the model’s performance (Table 6). Table 6: Ablation experiments. P/% R/% mAP@0.5/% Base 88.12 ± 2.74 90.33 ± 3.01 94.21 ± 2.81 Base+ CBA M 89.07 ± 2.64 94.52 ± 2.32 96.87 ± 2.56 Base+ CBA M+Fo cal CIoU 93.26 ± 2.77 96.77 ± 2.68 98.12 ± 2.32 F value 3.784 3.452 3.415 P value 0.007** 0.005** 0.006** Note: **: p < 0.01 It was found that adding the CBAM to the YOLOv5 algorithm significantly improved the R value. Introducing focal CIoU based on CBAM further enhanced the model’s recognition performance. It was found through comparison that the differences were significant (p < 0.01). These results validated the effectiveness of the improvement made to the YOLOv5 algorithm. Moreover, the improved YOLOv5 algorithm was compared with other recognition methods (Table 7). The Faster region-CNN algorithm was less effective in sign language recognition. Among the YOLO series algorithms, the YOLOv3 and YOLOv4 algorithms achieved mAP values slightly lower than the improved YOLOv5 algorithm. The results demonstrated the effectiveness of experiments on the improved YOLOv5 algorithm. Comparing the improved YOLOv5 algorithm with MobileNetV2 and ShuffleNetV2, the improved YOLOv5 algorithm achieved a higher mAP value. The statistical tests also suggested significant differences. These findings further validated the effectiveness of the proposed approach for sign language recognition. Table 7: Comparison with other recognition algorithms. P/% R/% mAP@0.5/% Faster region- CNN 80.12 ± 1.87 67.89 ± 1.65 79.84 ± 1.77 YOLOv3 87.77 ± 2.01 80.12 ± 1.77 90.31 ± 2.01 YOLOv4 88.05 ± 1.97 83.25 ± 1.56 92.56 ± 2.33 YOLOv5 88.12 ± 2.07 90.33 ± 1.64 94.21 ± 2.14 MobileNetV2 91.12 ± 2.56 81.94 ± 1.82 91.27 ± 2.05 ShuffleNetV2 91.08 ± 2.33 82.11 ± 2.01 91.26 ± 2.17 Improved YOLOv5 93.26 ± 2.77 96.77 ± 2.68 98.12 ± 2.32 F value 3.427 3.714 3.526 P value 0.008** 0.007** 0.008** Note: **: p < 0.01 124 Informatica 49 (2025) 119–126 N. Jing et al. 5 Discussion This paper developed a YOLOv5 algorithm combining the CBAM attention module and focal CIoU to recognize static sign language images. The performance of the proposed method in static sign language identification was verified through experiments on the ASL dataset. The results showed that adding the CBAM attention module and focal CIoU improved the detection performance of the YOLOv5 algorithm. CBAM can adaptively learn which pixels and channels are more important, which can not only improve the accuracy but also reduce the complexity of the model and alleviate overfitting. It has extensive applications in deep neural networks. The experimental results on the ASL dataset also verified the reliability of embedding the CBAM module into the YOLOv5 structure. Focal CIoU improves the detection performance by better focusing on the targets that may be ignored. Through comparison, it was found that compared with other loss functions, the P, R, and mAP values of focal CIoU were all higher, and the differences were significant (p < 0.01). The results verified the performance of the improved YOLOv5 algorithm in recognizing static sign language images. Therefore, this method can be extended to the recognition of other static images, and it can also be introduced into the recognition of dynamic sign language videos by converting dynamic sign language videos into static sign language images. However, there are also some limitations in this study. For instance, experiments were only conducted on a single dataset, and the recognition of continuous sign language was not achieved. In future work, further verification will be carried out on a broader range of datasets, and the recognition issues of dynamic and continuous sign language will be considered. 6 Conclusion This paper presents an improved YOLOv5 algorithm for sign language identification in hearing-impaired people. The performance of the proposed algorithm was assessed using the ASL dataset. The results demonstrated that adding the CBAM enhanced the algorithm’s recognition performance. Specifically, introducing the CBAM into the neck section yielded the best results. Moreover, focal loss further improved the algorithm’s performance in sign language recognition. These results highlight the practical applicability of the proposed approach in actual sign language recognition, ultimately aiding in communication for people with hearing impairments. References [1] Nandhini MAS, Shiva Roopan D, Shiyaam S, Yogesh S. Sign language recognition using convolutional neural network. Journal of Physics: Conference Series, 1916(1), pp. 1-11. https://doi.org/10.1088/1742-6596/1916/1/012091. [2] Sahoo AK (2021). Indian sign language recognition using machine learning techniques. Macromolecular Symposia, 397(1), pp. 2000241-1-2000241-7. https://doi.org/10.1002/masy.202000241. [3] Xu B, Huang S, Ye Z (2021). Application of tensor train decomposition in S2VT model for sign language recognition. IEEE Access, 9, pp. 35646- 35653, https://doi.org/10.1109/ACCESS.2021.3059660. [4] Al Rafi A, Hassan R, Rabiul Islam M, Nahiduzzaman M (2023). Real-time lightweight bangla sign language recognition model using pre- trained MobileNetV2 and conditional DCGAN. Proceedings of International Conference on Information and Communication Technology for Development, 2023, pp. 263-276. https://doi.org/10.1007/978-981-19-7528-8_21. [5] Takahashi R, Saito H (2022). Sign language recognition with 3D CNN transformer. Proceedings of the Annual Conference of JSAI, , pp. 4C1GS703- 4C1GS703. https://doi.org/10.11517/pjsai.JSAI2022.0_4C1GS7 03. [6] Yu Y, Chen X, Cao S, Zhang X, Chen X (2020). Exploration of chinese sign language recognition using wearable sensors based on deep belief net. IEEE Journal of Biomedical and Health Informatics, 24(5), pp. 1310-1320. https://doi.org/10.1109/JBHI.2019.2941535. [7] Joshi JM, Patel DU (2024). GIDSL: Indian-Gujarati isolated dynamic sign language recognition using deep learning. SN Computer Science, 5, pp. 527. https://doi.org/10.1007/s42979-024-02776-7 [8] Wang QS, Zheng ZW, Wang Q, Deng D, Zhang J (2024). Generalizations of wearable device placements and sentences in sign language recognition with transformer-based model. IEEE Transactions on Mobile Computing, 23(10), pp. 10046-10059. https://doi.org/10.1109/TMC.2024.3373472 [9] Sharma V, Sharma A, Saini S (2024). Real-time attention-based embedded LSTM for dynamic sign language recognition on edge devices. Journal of Real-Time Image Processing, 21(2), pp. 53.1-53.13. [10] Kourbane I, Genc Y (0021). Skeleton-aware multi- scale heatmap regression for 2D hand pose estimation. Informatica, 45(4), pp. 593-604. https://doi.org/10.48550/arXiv.2105.10904. [11] Ren S, He K, Girshick R, Sun J (2017). Faster R- CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), pp. 1137- 1149. https://doi.org/10.1109/TPAMI.2016.2577031. [12] Yeh CC, Chang YL, Alkhaleefah M, Hsu PH, Eng W, Koo VC, Huang B, Chang L (2021). YOLOv3- based matching approach for roof region detection from drone images. Remote Sensing, 13(1), pp. 1-23. https://doi.org/10.3390/rs13010127. [13] Wang L, Zhao Y, Liu S, Li Y, Chen S, Lan Y. (2022). Precision detection of dense plums in orchards using the improved YOLOv4 model. Frontiers in Plant Research on Sign Language Recognition for Hearing-Impaired… Informatica 49 (2025) 119–126 125 Science, 13, pp. 839269. https://doi.org/10.3389/fpls.2022.839269. [14] Sandler M, Howard A, Zhu M, Zhmoginov A. Chen LC (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510- 4520. https://doi.org/10.1109/CVPR.2018.00474. [15] Ma N, Zhang X, Zheng HT, Sun J. (2018). ShuffleNet V2: Practical guidelines for efficient cnn architecture design. European Conference on Computer Vision, pp. 122-138. https://doi.org/10.1007/978-3-030-01264-9_8. [16] Ogawa T, Uchida Y, Nishita Y, Tange C, Sugiura S, Ueda H, Nakada T, Suzuki H, Otsuka R, Ando F, Shimokata H (2019). Hearing-impaired elderly people have smaller social networks: A population- based aging study - ScienceDirect. Archives of Gerontology and Geriatrics, 83, pp. 75-80. https://doi.org/10.1016/j.archger.2019.03.004. [17] Enikeev DG, Mustafina SA (2021). Sign language recognition through Leap Motion controller and input prediction algorithm. Journal of Physics: Conference Series, 1715(1), pp. 1-7. https://doi.org/10.1088/1742-6596/1715/1/012008. [18] Tyagi A, Bansal S (2022). Hybrid FiST_CNN approach for feature extraction for vision-based Indian sign language recognition. The International Arab Journal of Information Technology, 19, pp. 403-411. https://doi.org/10.34028/iajit/19/3/15. [19] Fu L, Yu H, Li X, Przybyla CP, Wang S. Deep learning for object detection in materials-science images: a tutorial. IEEE Signal Processing Magazine, 39(1), pp. 78-88. https://doi.org/10.1109/MSP.2021.3121558. [20] Mopidevi S, Prasad MVD, Kishore PVV (2023). Multiview meta-metric learning for sign language recognition using triplet loss embeddings. Pattern Analysis and Applications: PAA, 26(3), pp. 1125- 1141. https://doi.org/10.1007/s10044-023-01134-2. [21] Das S, Biswas S K, Purkayastha B (2024). Occlusion robust sign language recognition system for indian sign language using CNN and pose features. Multimedia Tools and Applications, 83(36), pp. 84141-84160. https://doi.org/10.1007/s11042-024- 19068-0. [22] Yadav YG, Kiran VS, Karthik V, Thadikamalla GA, Kumaran P (2024). Real time sign language recognition using custom convolutional neural network and YOLOv5. International Conference on Intelligent Computing, Smart Communication and Network Technologies, pp. 157-171. https://doi.org/10.1007/978-3-031-75957-4_14. [23] Nath B, Sarkar S, Das S, Mukhopadhyay S (2022). Neural machine translation for Indian language pair using hybrid attention mechanism. Innovations in Systems and Software Engineering, 20, pp. 175-183. https://doi.org/10.1007/s11334-021-00429-z. [24] Zhu W, Shu Y, Liu S (2022). Power grid field violation recognition algorithm based on enhanced YOLOv5. Journal of Physics: Conference Series, 2209(1), pp. 1-10. https://doi.org/10.1088/1742- 6596/2209/1/012033. [25] Lv S, Liu X, Cao Y (2024). Remote sensing image recognition of dust cover net construction waste: a method combining convolutional block attention module and U-Net. Sensors & Materials, 36(7, Part 3), pp. 3131. https://doi.org/10.18494/SAM5182. [26] Li R, Wang X, Wang J, Song Y, Lei L (2020). SAR target recognition based on efficient fully convolutional attention block CNN. IEEE Geoscience and Remote Sensing Letters, 19, pp. 1-5. https://doi.org/10.1109/LGRS.2020.3037256. [27] Wang S, Chen M, Ratnavelu K, Shibghatullah ASB, Keoy KH (2024). Online classroom student engagement analysis based on facial expression recognition using enhanced YOLOv5 for mitigating cyberbullying. Measurement Science and Technology, 36(1), pp. 015419. https://doi.org/10.1088/1361-6501/ad8a80. [28] Sharma A, Chopra A, Singh M, Pandey A (2022). American sign language gesture analysis using tensorflow and integration in a drive-through. International Conference on Advances in Computing and Data Sciences, pp. 399-414. https://doi.org/10.1007/978-3-031-12638-3_33. [29] Qian X, Zhang N, Wang W (2023). Smooth GIoU loss for oriented object detection in remote sensing images. Remote Sensing, 15, pp. 1259. https://doi.org/10.3390/rs15051259. [30] Yuan D, Shu X, Fan N, Chang X, Liu Q, He Z (2022). Accurate bounding-box regression with distance- IoU loss for visual tracking. Journal of Visual Communication and Image Representation, 83, pp. 1.1-1.10. https://doi.org/10.1016/j.jvcir.2021.103428. 126 Informatica 49 (2025) 119–126 N. Jing et al.