Full Text

Turn on search term navigation

1. Introduction

Lip reading [1] is a human–computer interaction technology that has been widely used in various fields in recent years. Computer vision technology is responsible for continuously recognizing the speaker’s face information from images and videos, and further extracting the continuous lip change features when the speaker speaks, and then inputting the recognized mouth features into the lip-motion recognition model to recognize what the speaker says. Currently, lip reading technology is used in the fields of assisted automatic driving, assisted communication for hearing impaired people, and public safety [2,3]. The earliest research on lip reading originated from the 1950s, when Sumby, W.H. [1] first proposed that the series of lip movements during human speech could be used as a means of information acquisition, and thus the concept of lip reading was introduced, opening a new chapter of research in the field of lip reading. At present, lip reading systems are mainly divided into traditional lip reading systems and lip reading systems based on deep learning. Traditional lip reading is mainly composed of two parts: mouth feature extraction and lip movement recognition. The usual practice in mouth feature extraction is a pixel-based approach, which is to take the pixel values of the mouth region as the input visual information and extract the image features by discrete cosine transform (DCT) [4] and principal component analysis (PCA) [5]. For lip movement recognition, the Hidden Markov Model (HMM) [6] is most typically used, which considers lip movement sequences to be linear in a very short period of time and able to be represented by linear parametric models, and by connecting these linear models in series in time a Markov chain can be formed, and after obtaining the lip movement sequences and combining the known model parameters, the maximum probability recognition results. In recent years, with the continuous development of deep learning, it has started to gradually enter the field of lip movement. Oxford University, Google DeepMind, and the Canadian Institute for Advanced Research (CIFAR) jointly designed and studied the end-to-end lip reading model LipNet, which achieved 93.4% accuracy on the GRID corpus [7]. Jake Burton, David Frank, and others found that a CNN + LSTM based fusion network is the most featured input powerful recognition network [8]. Chen X., Du J. and others proposed a deep learning network DenseNet + resBiLSTM model for Mandarin sentence-level lips with good lip recognition, with a 13.91% improvement compared to LipNet [9].

In recent years, the recognition rate of lip reading models has been increasing and the network structure is growing, and as the network deepens and the amount of data increases, it poses a great challenge to the computing power as well as the storage capacity of the devices. Lightweight designs for models started to gradually become the focus of research today. In 2020, the Huawei team proposed GhostNet [10], which is inspired by the convolution process. Inspired by the similarity of generated feature maps, the Ghost module was introduced to generate the same number of feature maps using fewer parameters. Based on this, this paper adopts GhostNet as the backbone network for spatial feature extraction and improves on the network by introducing the Efficient Channel Attention module [11], which replaces the original Squeeze-and-Excitation module [12] to achieve performance improvement through a local cross-channel interaction strategy without dimensionality reduction. The purpose of reducing the number of parameters is also achieved. Since lip reading requires learning the back-and-forth association of lip-movement sequences, the improved Efficient-GhostNet is combined with the GRU [13] network to extract the temporal features of lip reading. At the same time, we carried out the work of creating our own dataset. Asian volunteers were selected for the dataset production for the recording of the dataset, and data enhancement was performed [14,15] by recording the dataset in three angles, as well as changing the brightness of the data and adding Gaussian noise to expand our dataset. The rest of the paper is as follows: in the second part, we present the network structure of the lip-recognition model. The third section contains the data of our proposed method, as well as the analysis and results. The fourth part contains the conclusion and future research directions.

2. Framework

In this section, we will introduce the research framework and the main research steps of this paper; the framework of the research is shown in Figure 1. First, we needed to perform video extraction with a fixed number of frames for the lip video, and then we used the Dilib model to obtain the face features, determine the location of the lip, and crop the obtained lip-motion sequence with a fixed size. Then, the obtained lip-movement sequences were inputted to the improved Efficient-GhostNet to obtain the feature vectors of the sequences, which would be outputted, and finally, we used the output of the previous part as the input of GRU to extract the temporal features of the lip-movement sequences using GRU, and then obtained the final prediction results by a SoftMax layer [16].

2.1. Image Feature Extraction-GhostNet

2.1.1. Ghost Module

With the continuous development of deep learning there are more and more various network architectures, but due to the increase in data volume, people started to focus on limited memory and computational resources to develop lightweight network models; therefore, there are ways of model compression such as network pruning [17,18,19], quantization [18] and knowledge distillation [20]. In addition to these methods, there are some effective network architectures that can also achieve lightweightedness: MobileNet [21,22], ShuffleNet [23] and DenseNet [24]. MobileNet uses depthwise and pointwise convolution to build a unit, and uses larger convolutional kernels to convolve the original layers. ShuffleNet achieves model lightness by changing the channel structure. In this paper, we use GhostNet, a lightweight model, to perform feature extraction for lip reading. The first step is to build the Ghost module, whose structure is shown in Figure 2, which uses some low-cost linear operations to generate many feature maps and is also able to obtain the information under the eigenfeatures map. Specifically, first use the ordinary 2D convolution to generate part of the feature map, and then use the feature map for depth-separable convolution to obtain another part of the feature map, and then the two are stacked output.

Assuming that the input data are $X \in R^(c \times h \times w)$ , where c is the number of channels of the input data and h and w are the height and width of the input data, respectively, the algorithms of n feature maps generated by any convolutional layer can be expressed as follows:

(1) $Y = X * f + b$

where * indicates the convolution operation, b is the bias term,

Y \in R^{h^{'} \times w^{'} \times n}

is the output feature map of n channels, and

\in R^{c \times k \times k \times n}

is the convolution filter of the layer.

h^{'}

and

w^{'}

are the height and width of the output data, respectively, and k × k is the kernel size of the convolution filter. In the convolution process, the expression of FLOPs is

n * h^’ * w^’ * c * k * k

. Since the number of filters n and the number of channels c are large, the calculation result of FLOPs will also be large.

In the base of Equation (1), it is obvious that the number of parameters to be optimized (in f and b) is determined by the dimensionality of the input and output feature maps. As shown in Figure 3, the number of output feature maps of the convolutional layers often contain a large number of redundancies, some of which are very similar. We believe that it is not necessary to generate all redundant feature maps one by one, because this consumes a large number of FLOPs and parameters, which is a waste of resources. Suppose that the output eigenmaps generated by some inexpensive transformations are “ghosts” of the eigenmaps, which are usually small in size and generated by the original convolution kernel. In particular, the m eigenmaps $Y^{'} \in R^{h^{'} \times w^{’} \times m}$ are generated using a basic convolution operation:

(2) $Y^{'} = X * f^{'}$

where

f^{'} \in R^{c \times k \times k \times m}

is the filter used, and to further obtain the desired n feature maps some ordinary convolution and depth-separable convolution operations are used to generate s ghost features on the present evidence feature map

Y^{'}

according to the following function, which is as follows.

(3) $y_{i j} = Φ_{i, j} (y_{i}^{’}), \forall i = 1, \dots, m, j = 1, \dots, s,$

where

y_{i}^{’}

denotes the i-th intrinsic feature mapping in

Y^{'}

, and

Φ_{i, j}

in the above equation is the j-th current operation used to generate the j-th ghost feature mapping

y_{i j}

, which means that

y_{i}^{’}

can have one or even more ghost feature mappings

{\{y_{i j}\}}_{j}^{s} = 1

, and the last

Φ_{i, s}

is used to save the identity mapping shown in the above Figure 2b. The identity mapping of the intrinsic feature mapping is shown in Figure 2b above. With the above equation,

n = m * s

feature mappings, and

Y = y_{11}, y_{12}, \dots, y_{m s}

can be obtained as the output of the Ghost model in Figure 2. Since the linear operations in each channel are composed of both depth-separable convolution and ordinary convolution, the computational cost of this structure is much smaller than that of the channel computation composed of ordinary convolutions.

2.1.2. Ghost Bottleneck

The structure of Bottleneck with two step sizes is shown in Figure 4, which shows that it is made by stacking two Ghost modules and finally connecting a residual structure to form a Bottleneck with step size 1, which is intended to deepen the network structure. The Bottleneck with step size 2 also consists of two Ghost modules, with the difference being that a deep separable convolution operation with step size 2 is added in the middle, and the same residual structure as ResNet [25] is added to the output, and the output is $H (x) = F (x) + x$ . The purpose is to effectively avoid the problem of gradient disappearance and performance degradation. After each layer, a batch normalization [26] process is performed and the ReLU function [27,28] is activated after the first bottleneck structure. The purpose of this structure is to deepen the network structure while changing the size of the feature layer.

2.1.3. Efficient-GhostNet

The improvement structure of GhostNet is shown in Figure 5 by building a network based on Ghost Bottleneck and improving GhostNet. The improved GhostNet mainly consists of a stack of Ghost bottleneck; firstly, the first layer is a feature layer with 16 channels composed of 3 × 3 convolution kernels in 2D, and after that it is stacked by Ghost bottleneck with different steps to make it deepen the number of channels and also change the size of the feature map. Finally, a global average pooling [29] and 2D convolution [30] are used to convert the features into 1280-dimensional feature vectors. Compared with the original GhostNet, the more efficient channel attention module is introduced into the model instead of squeeze-and-excitation, which can reduce the number of parameters while obtaining comparable accuracy. * in the figure indicates multiplication.

2.1.4. Efficient Channel Attention

Nowadays, channel attention mechanisms are widely used in convolutional neural networks; however, most current methods are designed to be overly complex in order to achieve better performance, which leads to an increase in the number of parameters and computational effort. In this paper, we introduce a more efficient ECA module, which can avoid the effect of dimensionality reduction and has the advantage of achieving performance improvement and reducing the number of parameters without dimensionality reduction through a local cross-channel interaction strategy; we also introduce adaptive selection of the convolutional kernel size, i.e., the size of the convolutional kernel will be changed flexibly according to the size of the number of channels, which aims to better achieve the inter-channel information exchange between channels. The structure is shown in Figure 6. The input feature map is first subjected to a global average pooling operation, and then a fast one-dimensional convolution operation of size K, where the size of K indicates the coverage of local cross-channel interactions, i.e., how many domains are involved in the attention prediction of a channel, and finally the obtained attention weights are multiplied with the corresponding elements of the original feature map to obtain the final output feature map.

Since the module needs to find the appropriate cross-channel interaction, it needs to determine the coverage of the interaction, that is, the value of k, which is the size of the sliding window. If the value of k needs to be manually adjusted for different structures, it will consume a relatively large amount of computational resources; in fact, the cross-channel coverage is positively correlated with the channel in a more reasonable way, that is, there is a mapping ϕ between the channel C and k:

(4) $C = ϕ (k)$

The simplest mapping is a linear function, which is $ϕ = γ * k - b$ . Since linear functions have limitations on the characterization of eigenrelations, and also the number of channels is usually set to the power of 2, the linear function is extended to a nonlinear function to introduce a possible solution:

(5) $C = ϕ (k) = 2^{γ * k - b}$

Then, given the channel dimension C, the convolution kernel size k can be determined as:

(6) $k = Ψ (C) = {|\frac{l o g_{2} C}{γ} + \frac{b}{γ}|}_{o d d}$

where

{|t|}_{o d d}

denotes the odd number closest to t, γ and b are set to 2 and 1, respectively.

2.2. Gated Recurrent Unit (GRU)

For lip reading like this, the whole process of lip reading needs to be analyzed and it is necessary to obtain a time-dependent lip reading sequence. In deep learning, RNN networks are the most commonly used models to deal with a temporal correlation problem, but RNN models all suffer from a long-term dependency, as well as gradient disappearance and gradient explosion problems. In 1997, the LSTM [31] proposed by Hochreiter and Schmid-Huber solved the above problem very well by introducing the concept of gate structure to achieve feature retention and discard. The Gated Recurrent Unit (GRU) is a variant of the one mentioned above, which has the advantage of a simpler network structure, as shown in Figure 7.

To solve the problem of gradient disappearance and gradient explosion in RNN, an update gate and reset gate are used in GRU to replace the structure of input gate, output gate, and forget gate in LSTM, three gates which can judge and decide whether the information is passed as output or not. The update gate can control the information of the previous state to be brought to the current state, and the reset gate controls the information of the previous state to be written to the current candidate state, so that many pieces of sequential information can be retained and the irrelevant information can be deleted. The formula of each part of GRU network is as follows:

(7) $r_{t} = σ (W_{x r} x_{t} + W_{h r} h_{t - 1} + b_{r})$

(8) $z_{t} = σ (W_{x z} x_{t} + W_{h z} h_{t - 1} + b_{z})$

(9) $\tilde{h_{t}} = t a n h (W_{x \tilde{h}} x_{t} + W_{h \tilde{h}} (r_{t} ⊙ h_{t - 1}) + b_{\tilde{h}})$

(10) $h_{t} = z_{t} ⊙ \tilde{h_{t}} + (1 - z_{t}) ⊙ h_{t - 1}$

(11) $y_{t} = σ (W_{0} \cdot h_{t})$

where

W_{x z}

and

W_{h z}

are the respective corresponding weight vectors,

b_{z}

is the bias vector, and the update gate is passed through the sigmoid function to obtain the gating threshold,

W_{x \tilde{h}}

W_{h \tilde{h}}

denote the weight vectors corresponding to the lattice,

b_{\tilde{h}}

is the bias vector, and the size of

r_{t}

will determine how much of the past information is retained by the model.

z_{t}

is the update gate signal mentioned above, indicating that the output signal is controlled by the update gate signal.

3. Results and Discussion

3.1. Dataset

At present, most of the lip reading datasets [32,33,34,35] are recorded by European and American people, and there are few datasets made for Asian people. Therefore, in this paper, we produced an Asian dataset for training, which is more consistent with the lip reading characteristics of Asian people, meaning the model has better recognition effect for Asian people.

Meanwhile, since the ultimate purpose of the system of lip reading is for it to be applied in, for example, the recognition of public places, the lip reading of disabled people, etc., this paper proposes data enhancement of the lip dataset based on different angles of the lip dataset, i.e., recording the lip dataset with frontal, 15 degrees to the left, and 15 degrees to the right perspectives. The purpose of doing so is as follows. Firstly, to be able to improve the robustness of the network and better reduce the influence of other factors, and secondly because such data enhancement is also a better reflection of real life, in which the videos in real life are usually unlikely to be completely positive videos.

Based on this, the dataset in this paper is our self-made lip-language video database, recorded by 10 speakers, containing five boys and five girls, the data contain ten numbers 0–9, and each number is repeated 20 times. In the process of recording, the camera and the head of the person are maintain relative stillness, respectively, three cameras from different angles are used at the same time to record the dataset, and each video is independently edited. Therefore, it is possible to obtain the front, left 15 degrees, right 15 degree video dataset, totaling 6000 samples, while adding Gaussian noise to these samples and changing the brightness to expand the samples to 12,000, with a resolution of 1920 × 1080 and a frame rate of 25 frames per second, about 1–2 s per video. After obtaining the dataset, we marked and intercepted all video mouth parts, and drew frames by ten frames for each video, and processed the processed dataset into 224 × 224 pixels. We divided the dataset into 80% as the training set and 20% as the test set, and part of the dataset is shown in Figure 8.

3.2. Results

To validate the effectiveness of Efficient-GhostNet, the lightweight network used in this paper, we compared its computational effort with Top-1 accuracy in the table. We used MobileNet, GhostNet, and our Efficient-GhostNet, and divided these models into three complexity levels. As can be seen from the Table 1, the Efficient-GhostNet network improves compared to the pre-improvement one for the limits of 40, 150, and 300 for MFLPs, respectively, and the larger FLOPs highlight the effectiveness of their models better.

The self-made dataset was inputted into the Efficient-GhostNet network to obtain spatial features, and the feature vectors were inputted into the GRU network to obtain temporal features, and the loss function and accuracy of each cycle were recorded, and the results are shown in Figure 9. The convergence of the model during training and the feasibility of the model were evaluated.

In Figure 9, when the period reaches 70, the convergence rate starts to slow down and gradually stabilizes and the loss function and accuracy rate both tend to stabilize, indicating that the model has reached the optimum and the accuracy rate of the proposed network is 88.8%.

In this paper, in order to evaluate the performance of the lip-reading system, we conducted an experimental comparison of some current mainstream models and the results are shown in Table 2. It can be seen that the model proposed can perform the prediction task of lip-reading well, while the model we used can reduce the number of parameters of the network with comparable accuracy and also reduce the prediction time of lip-reading, so our proposed model is effective.

We tested the accuracy of the two models individually for ten words both before and after the improvement, and the results are shown in Figure 10. The experimental results can show that the improved model has better recognition accuracy. In addition, the model has better recognition for “two” and “four”, which is due to the relatively simple pronunciation of “two” and “four” and the more obvious lip movements; the recognition of “zero” is worse, which is due to the relatively complex pronunciation of “zero”, the large variation of mouth pronunciation and individual differences, and also its reliance on the tongue to control.

In order to test the recognition ability of our proposed model for phrases, we used the existing dataset GRID corpus for our experiments and also conducted experimental comparisons with some current models; the results are shown in the following Table 3.

Through our experiments, we can see that our proposed model can accurately identify the results and perform better than some of the previously proposed models in the process of phrase recognition.

4. Conclusions

In this paper, we propose an optimized lightweight network based on the improved GhostNet network by replacing the original SE module with the ECA module to achieve better performance and reduce the number of parameters through a local cross-channel interaction strategy without dimensionality reduction, and since lip reading also needs to consider the temporal features of picture frames, the GRU network with fewer parameters is used. The GRU network with fewer parameters is used for the extraction of temporal features. The lip-synthesis dataset in this paper uses 10 speakers, including five boys and five girls, and the data contain ten numbers from 0 to 9. The experiments show that the proposed model reduces the prediction time by 24.1% and the amount of model parameters by 25.4% under the condition that the accuracy is guaranteed to be comparable, which validates the effectiveness of the proposed model in this paper. In order to demonstrate the recognition ability of the proposed model for phrases, we conducted a set of comparison experiments and the results show that our model also has a better recognition ability for phrases. In future research, we will expand our dataset to enhance the generality of the lip reading model. Since our recognition study was conducted in chronological order rather than for specific types of lip movements at specific times, the real-time performance is not good and future research will address how to improve the real-time performance of the model for the purpose of deployment to mobile devices.

Author Contributions

Conceptualization, G.Z. and Y.L.; methodology, G.Z.; software, G.Z.; validation, G.Z.; formal analysis, G.Z.; investigation, G.Z.; resources, Y.L.; data curation, G.Z.; writing—original draft preparation, G.Z.; writing—review and editing, Y.L. and G.Z.; visualization, G.Z.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Acknowledgments

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The structure of the lip-reading model.

Figure 2. Ghost module.

Figure 3. Visualization of partial feature maps generated by resnet network.

Figure 4. Ghost Bottleneck.

Figure 5. Improved feature extraction structure.

Figure 6. ECA structure diagram.

Figure 7. GRU structure diagram.

Figure 8. Lip motion map of the dataset.

Figure 9. (a) Loss of each period in our network; (b) accuracy of each period in our network.

Figure 10. Comparisons of recognition accuracy between two networks.

Table 1

Performance comparison of lightweight models under homemade datasets.

Models	FLOPs (M)	Top-1 Acc. (%)
MobileNetV2 0.3×GhostNet 0.5×Efficient-GhostNet 0.5×	414240	54.755.656.2
MobileNetV2 0.6×GhostNet 1.0×Efficient-GhostNet 1.0×	141141138	65.666.467.8
MobileNetV2 1.0×GhostNet 1.3×Efficient-GhostNet 1.3×	299266251	72.474.776.3

Table 2

Performance comparison of each model with the homemade dataset.

Models	Params./M	Time/s	Acc./%
VGG16 + LSTM	139	16.3	91.4
Resnet50 + LSTM	25	5.7	87.8
MobileNet + GRU	5.4	3.1	88.6
GhostNet + GRU	5.1	2.9	88.7
Efficient-GhostNet + GRU	3.8	2.2	88.8

Table 3

Performance comparison of each model with the GRID dataset.

MODEL	Recognition Task	Rec. Rate/%
HOG + SVM [36]	Phrases	71.2
CNN [37]	Phrases	64.8
Feed-forward + LSTM [38]	Phrases	84.7
3D-CNN + highway + Bi-GRU + attention [39]	Phrases	97.1
STCNN [40]	Phrases	95.5
MobileNet + GRU	Phrases	92.8
Efficient-GhostNet + GRU	Phrases	94.1

References

1. Sumby, W.H. Visual Contribution to Speech Intelligibility in Noise. J. Acoust. Soc. Am.; 1954; 26, pp. 212-215. [DOI: https://dx.doi.org/10.1121/1.1907309]

2. Kastaniotis, D.; Tsourounis, D.; Koureleas, A.; Peev, B.; Theoharatos, C.; Fotopoulos, S. Lip Reading in Greek words at unconstrained driving scenario. Proceedings of the 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA); Patras, Greece, 15–17 July 2019; pp. 1-6. [DOI: https://dx.doi.org/10.1109/IISA.2019.8900757]

3. Abrar, M.A.; Islam, A.N.M.N.; Hassan, M.M.; Islam, M.T.; Shahnaz, C.; Fattah, S.A. Deep Lip Reading-A Deep Learning Based Lip-Reading Software for the Hearing Impaired. Proceedings of the 2019 IEEE R10 Humanitarian Technology Conference (R10-HTC) (47129); Depok, West Java, Indonesia, 12–14 November 2019; pp. 40-44. [DOI: https://dx.doi.org/10.1109/R10-HTC47129.2019.9042439]

4. Scanlon, P.; Reilly, R. Feature analysis for automatic speechreading. Proceedings of the 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564); Cannes, France, 3–5 October 2001; pp. 625-630. [DOI: https://dx.doi.org/10.1109/MMSP.2001.962802]

5. Aleksic, P.S.; Katsaggelos, A.K. Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing; Montreal, QC, Canada, 17–21 May 2004; V-917. [DOI: https://dx.doi.org/10.1109/ICASSP.2004.1327261]

6. Minotto, V.P.; Lopes, C.B.O.; Scharcanski, J.; Jung, C.R.; Lee, B. Audiovisual Voice Activity Detection Based on Microphone Arrays and Color Information. IEEE J. Sel. Top. Signal Process.; 2013; 7, pp. 147-156. [DOI: https://dx.doi.org/10.1109/JSTSP.2012.2237379]

7. Assael, Y.M.; Shillingford, B.; Whiteson, S. LipNet: End-to-end sentence-level lipreading. Proceedings of the International Conference on Learning Representations (ICLR); Toulon, France, 24–26 April 2016; [DOI: https://dx.doi.org/10.48550/arXiv.1611.01599]

8. Burton, J.; Frank, D.; Saleh, M.; Navab, N.; Bear, H.L. The speaker-independent lipreading play-off; a survey of lipreading machines. Proceedings of the 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS); Sophia Antipolis, France, 12–14 December 2018; pp. 125-130. [DOI: https://dx.doi.org/10.1109/IPAS.2018.8708874]

9. Chen, X.; Du, J.; Zhang, H. Lipreading with DenseNet and resBi-LSTM. Signal Image Video Process.; 2020; 14, pp. 981-989. [DOI: https://dx.doi.org/10.1007/s11760-019-01630-1]

10. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C. GhostNet: More Features from Cheap Operations. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020; pp. 1577-1586. [DOI: https://dx.doi.org/10.1109/CVPR42600.2020.00165]

11. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020; pp. 11531-11539. [DOI: https://dx.doi.org/10.1109/CVPR42600.2020.01155]

12. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 42, pp. 2011-2023. [DOI: https://dx.doi.org/10.1109/TPAMI.2019.2913372] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31034408]

13. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv; 2014; [DOI: https://dx.doi.org/10.48550/arXiv.1412.3555] arXiv: 1412.3555

14. Gabbay, A.; Shamir, A.; Peleg, S. Visual speech enhancement. arXiv; 2017; arXiv: 1711.08789

15. Hou, J.C.; Wang, S.S.; Lai, Y.H.; Tsao, Y.; Chang, H.W.; Wang, H.M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. arXiv; 2017; arXiv: 1703.10893

16. Zhu, D.; Lu, S.; Wang, M.; Lin, J.; Wang, Z. Efficient Precision-Adjustable Architecture for Softmax Function in Deep Learning. IEEE Trans. Circuits Syst. II Express Briefs; 2020; 67, pp. 3382-3386. [DOI: https://dx.doi.org/10.1109/TCSII.2020.3002564]

17. Wang, Z.; Li, C.; Wang, X. Convolutional Neural Network Pruning with Structural Redundancy Reduction. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 20–25 June 2021; pp. 14908-14917. [DOI: https://dx.doi.org/10.1109/CVPR46437.2021.01467]

18. Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing; 2021; 461, pp. 370-403. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.07.045]

19. Yan, Y.; Liu, B.; Lin, W.; Chen, Y.; Li, K.; Ou, J.; Fan, C. MCCP: Multi-Collaboration Channel Pruning for Model Compression. Neural Process. Lett.; 2022; [DOI: https://dx.doi.org/10.1007/s11063-022-10984-6]

20. Xu, C.; Gao, W.; Li, T.; Bai, N.; Li, G.; Zhang, Y. Teacher-student collaborative knowledge distillation for image classification. Appl. Intel.; 2023; 53, pp. 1997-2009. [DOI: https://dx.doi.org/10.1007/s10489-022-03486-4]

21. Andrew, G.H.; Zhu, M.; Chen, B.; Kalenichenko, D. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv; 2017; arXiv: 1704.0486

22. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510-4520. [DOI: https://dx.doi.org/10.1109/CVPR.2018.00474]

23. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848-6856. [DOI: https://dx.doi.org/10.1109/CVPR.2018.00716]

24. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017; pp. 2261-2269. [DOI: https://dx.doi.org/10.1109/CVPR.2017.243]

25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778. [DOI: https://dx.doi.org/10.1109/CVPR.2016.90]

26. Yang, Z.; Wang, L.; Luo, L.; Li, S.; Guo, S.; Wang, S. Bactran: A Hardware Batch Normalization Implementation for CNN Training Engine. IEEE Embed. Syst. Lett.; 2021; 13, pp. 29-32. [DOI: https://dx.doi.org/10.1109/LES.2020.2975055]

27. Liu, B.; Liang, Y. Optimal function approximation with ReLU neural networks. Neurocomputing; 2021; 435, pp. 216-227. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.01.007]

28. Boob, D.; Dey, S.S.; Lan, G. Complexity of training ReLU neural network. Discret. Optim.; 2022; 44, 100620. [DOI: https://dx.doi.org/10.1016/j.disopt.2020.100620]

29. Miled, M.; Messaoud, M.A.B.; Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimed Tools Appl; 2023; 82, pp. 551-571. [DOI: https://dx.doi.org/10.1007/s11042-022-13321-0]

30. El-Bialy, R.; Chen, D.; Fenghour, S.; Hussein, W.; Xiao, P.; Karam, O.H.; Li, B. Developing phoneme-based lip-reading sentences system for silent speech recognition. CAAI Trans. Intell. Technol.; 2022; pp. 1-10. [DOI: https://dx.doi.org/10.1049/cit2.12131]

31. Wand, M.; Schmidhuber, J. Improving Speaker-Independent Lipreading with Domain-Adversarial Training. Proceedings of the INTERSPEECH 2017; Stockholm, Sweden, 20–24 August 2017; pp. 3662-3666. [DOI: https://dx.doi.org/10.21437/Interspeech.2017-421]

32. Zhao, G.Y.; Barnard, M.; Pietikäinen, M. Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed.; 2009; 11, pp. 1254-1265. [DOI: https://dx.doi.org/10.1109/TMM.2009.2030637]

33. Chung, J.S.; Zisserman, A. Lip Reading in the Wild; Springer: Cham, Switzerland, 2016; [DOI: https://dx.doi.org/10.1007/978-3-319-54184-6_6]

34. Zeng, D.; Yu, Y.; Oyama, K. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Trans. Multimed. Comput. Commun. Appl.; 2020; 16, pp. 1-23. [DOI: https://dx.doi.org/10.1145/3387164]

35. Sato, T.; Sugano, Y.; Sato, Y. Self-Supervised Learning for Audio-Visual Relationships of Videos with Stereo Sounds. IEEE Access; 2022; 10, pp. 94273-94284. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3204305]

36. Wand, M.; Koutník, J.; Schmidhuber, J. Lipreading with long short-term memory. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Shanghai, China, 20–25 March 2016; pp. 6115-6119.

37. Jha, A.; Namboodiri, V.P.; Jawahar, C.V. Word spotting in silent lip videos. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); Lake Tahoe, NV, USA, 12–15 March 2018.

38. Wand, M.; Schmidhuber, J.; Vu, N.T. Investigations on End-to-End Audiovisual Fusion. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Calgary, AB, Canada, 15–20 April 2018; pp. 3041-3045.

39. Xu, K.; Li, D.; Cassimatis, N.; Wang, X. LCANet: End-to-end lipreading with cascaded attention-CTC. Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018); Xi’an, China, 15–19 May 2018; pp. 548-555.

40. Liu, J.; Ren, Y.; Zhao, Z.; Zhang, C.; Yuan, J. FastLR: Nonautoregressive lipreading model with integrate-and-fire. Proceedings of the 28th ACM International Conference on Multimedia; Seattle, WA, USA, 12–16 October 2020.

Word count: 5143

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Lip reading technology refers to the analysis of the visual information of the speaker’s mouth movements to recognize the content of the speaker’s speech. As one of the important aspects of human–computer interaction, the technology of lip reading has gradually become popular with the development of deep learning in recent years. At present, most the lip reading networks are very complex, with very large numbers of parameters and computation, and the model generated by training needs to occupy large memory, which brings difficulties for devices with limited storage capacity and computation power, such as mobile terminals. Based on the above problems, this paper optimizes and improves GhostNet, a lightweight network, and improves on it by proposing a more efficient Efficient-GhostNet, which achieves performance improvement while reducing the number of parameters through a local cross-channel interaction strategy, without dimensionality reduction. The improved Efficient-GhostNet is used to perform lip spatial feature extraction, and then the extracted features are inputted to the GRU network to obtain the temporal features of the lip sequences, and finally for prediction. We used Asian volunteers for the recording of the dataset in this paper, while also adopting data enhancement for the dataset, using the angle transformation of the dataset to deflect the recording process of the recorder by 15 degrees each to the left and right, in order to be able to enhance the robustness of the network and better reduce the influence of other factors, as well as to improve the generalization ability of the model so that the model can be more consistent with recognition scenarios in real life. Experiments prove that the improved Efficient-GhostNet + GRU model can achieve the purpose of reducing the number of parameters with comparable accuracy.

Details

Title

Research on a Lip Reading Algorithm Based on Efficient-GhostNet

Author

Zhang, Gaoyan; Lu, Yuanyao

First page

1151

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics12051151

ProQuest document ID

2785187073

Research on a Lip Reading Algorithm Based on Efficient-GhostNet

Jump to:

Full Text

Abstract

Details

Suggested sources