Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

With the rapid development of education and teaching, English listening training can no longer meet the diverse, comprehensive, and complex needs of the students. In such an environment, teachers must study the core literacy of the English curriculum in depth, grasp the learning situation and development trend of each stage, and plan and design classroom teaching according to students’ cognitive, thinking, and learning rules to promote students’ English learning and listening skills. Combined with teaching practice, we discuss how to cultivate the relationship between listening, speaking, reading, and writing in English teaching and teach from the creation of language environment, extracurricular extension, and listening, speaking, reading, and writing in English. In terms of teaching content, the development of artificial intelligence and information technology provides rich learning resources for English speaking learning. Teachers should integrate learning resources according to actual teaching needs and feedback from students and focus on the integration of knowledge and ability as well as the integration of language and culture in teaching content, build a spider web-type knowledge structure, broaden the breadth and depth of language materials, and improve students’ thinking and language use ability.

Many colleges are cutting back on college English teaching [1]. English classes in some colleges have been reduced from four to three hours a week. By the second semester of grade two, there were only two classes per week. There are no college English courses for juniors and seniors. In addition, the teaching of spoken English is classified as a college English audio-visual instruction, and classes are only offered every three weeks. Colleges do not have any requirement for students’ English speaking ability, nor do they offer any kind of English speaking test, or even include it in the scope of the college English syllabus. From the students’ point of view, many colleges do not set a threshold line for English grades when admitting students, thus leading to a wide range of students’ English proficiency. This also has a great impact on the teaching of spoken English in college. Learning motivation is the tendency to guide and maintain all kinds of learning activities and is a kind of internal motivation to directly promote students to study. Most of the students’ English foundation is poor, coupled with the fact that high school English teaching is mainly focused on grammar learning, with the aim of getting high marks, and the classes are boring and tedious. This results in most students not being interested in English and students lacking the intrinsic motivation to learn to speak. From a pedagogical point of view, speaking instruction is mainly based on traditional classroom teaching. Students have fewer opportunities for oral input, and their oral output naturally becomes less [2].

The classroom ecological view emphasizes the creation of a harmonious and sustainable classroom ecological environment, which provides new ideas to solve the problems of imbalance and stagnant development of professional English speaking classroom ecology in the context of information technology. On the basis of following the principles of wholeness, interaction, balance, and sustainable development of educational ecology, enhancing the synergy of English speaking classroom ecological factors and promoting interactive dialogue among ecological subjects by using speech recognition will help give full play to the advantageous role of information technology in English speaking teaching reform, maintain the dynamic balance of English speaking classroom ecology, and promote its sustainable development.

Intelligent teaching method advocates use modern computer information technology to realize intelligent learning, improve students’ advanced thinking ability and innovation training, and meet students’ personalized development needs [3]. Instructional tools refer to the tools, media, or equipment used by teachers and students to transfer information in teaching and have undergone development from traditional means such as oral language, written text, and printed textbooks to modern means characterized by electronic audio-visual equipment, multimedia networks, and current applications of big data, virtual reality technology, and artificial intelligence technology. The modern teaching approach has changed the traditional teaching and learning methods, stimulated students’ curiosity and desire for knowledge, and revitalized classroom teaching. The impact of technological breakthroughs and updates on teaching as an activity is far-reaching and extensive.

In terms of teaching philosophy, teachers should adhere to the “student-centered” basis, make reasonable use of the advantages of information technology, prevent information technology from dominating the center of teaching or teachers from dominating classroom discourse, give students the opportunity to fully develop their potential, let students become the subject of designing learning activities, participate in the whole process of teaching, and be able to discuss, analyze, and formulate learning goals, learning plans, and learning strategies on their own and make full use of information technology tools and resources to promote the transformation of students’ learning into open, personalized, and inquiry-based learning, helping students master language knowledge, improve their comprehensive language skills, and develop sustainable learning abilities in the process of learning to learn. In terms of teaching environment, as mentioned above, the overlimited teaching environment has become a limiting factor for English speaking classroom teaching. If the number of students in the class cannot be changed, teachers can adopt group teaching, change the classroom tables and chairs according to the actual teaching needs, create open and simulated communicative scenarios with the help of Internet resources, optimize the classroom ecological environment, and increase students’ opportunities to use language. At the same time, teachers should make students fully aware of their main position in the classroom, maximize students’ learning enthusiasm and autonomy, allow students to adjust their learning methods according to their own learning situation, and collect timely learning feedback to further improve teaching. In terms of teaching content, the development of modern information technology provides rich learning resources for oral English learning.

Teachers should integrate learning resources according to the actual teaching needs and feedback from students and pay attention to the integration of knowledge and ability as well as the integration of language and culture in the teaching content, build a cobweb-type knowledge structure, broaden the breadth and depth of language materials, and improve students’ ability to think and use language. Finally, the evaluation subjects in the reform of English speaking informatics teaching include teachers and students, but the content of evaluation often revolves around the completion of students’ language output tasks. To enable students to construct knowledge and improve their comprehensive literacy in a purposeful and targeted manner, the evaluation of students can also be extended to include their learning attitudes, potential, learning habits, and management. In addition, teachers and students can also evaluate teachers’ teaching methods, teaching software, teaching content, and teaching environment to help teachers obtain feedback and improve their teaching in a more targeted and efficient way. Of course, the evaluation is not limited to simple score evaluation but can also be done in the form of interviews, questionnaires, and learning and resource usage data collection, in order to further optimize the teaching of spoken English. The communication and interaction between the subjects of the teaching ecology help transfer knowledge, information, emotion, and intellectual energy, which is the fundamental reason for the evolution of the English classroom ecology. Interaction in the ecological classroom includes group activities between teachers and students. The ecological subjects of teaching should actively communicate with each other and coordinate with each other in dialogue and interaction to achieve common development. The application of information technology is a double-edged sword for the interaction between ecological subjects.

The popularity and high-speed development of the Internet help subjects to communicate with each other in real time through the network platform, but due to the advantages and disadvantages of the network environment, the subjective willingness of subjects to communicate, and the differences in ideas and goals, this kind of communication relying on electronic screens often has a certain lag and is not conducive to the emotional interaction between subjects. The emotional interaction between teachers and students is a catalyst for cognitive activities and an important condition for successful teaching. According to the ecological view of the classroom, teachers and students should communicate with each other in a timely manner and have equal dialogue with each other. Active, effective, and regular communication helps teachers and students achieve subjective construction and development of both teachers and students while transferring and exchanging knowledge. The teachers should abandon the indoctrination teaching method and emphasize the communication, sharing, and feedback between teachers and students as well as between students, so that students can explore, choose, and construct their own knowledge in an open and free atmosphere. In addition, teachers can guide students to gradually form “learning communities” through regular cooperative learning and communication. This not only helps students to reduce their discomfort and anxiety in the face of information-based teaching reform but also enables learning members with different knowledge structures, thinking styles, cognitive styles, and learning habits to complement each other. In addition, the “learning community” can help students understand the relationship between the individual and the group and between the parts and the whole and cultivate their sense of responsibility and team consciousness.

In recent years, artificial intelligence based on information technology and college English education have entered the stage of integration and innovation, which is rapidly overturning the educational ideas and methods accumulated for thousands of years and reconstructing the educational ecology. Intelligent speech technology, English language assessment system, language translation, intelligent oral practice, adaptive system, personalized learning center, and intelligent tutor system are widely used in college English teaching, bringing unprecedented opportunities for college English teaching. It provides a solution to the decades-old problems of insufficient teaching resources, difficulty in practicing “teaching to students according to their abilities,” and unscientific course evaluation in the field of college English teaching. It is obvious that the traditional teaching objectives, curriculum and teaching mode, and teachers’ expertise are not sufficient to cope with the needs of the new generation of AI technology. We must actively seek changes to find the right fit between AI and college English education.

The main contributions of this paper are the following: (1) the application and current situation of computer technology in English teaching are analyzed. (2) In order to overcome the backwardness and low accuracy of pronunciation evaluation methods, this paper applies deep learning in computer information technology to English speech recognition and constructs an LSTM-based speech recognition model to improve recognition accuracy. (3) Based on this, multiple parameters are considered to establish a reasonable and objective English speech recognition and pronunciation quality evaluation model.

2. Related Works

2.1. Current Research on College English Education in the Information Age

The practice of English speaking course reform has for a long time ignored the new characteristics arising from the encounter between information technology and professional English, thus triggering conflicts among the ecological elements within the teaching system and hindering the sustainable development of English speaking classroom ecology in the context of information technology. Teachers should conform to the development of the informatization era, grasp the advantageous role of information technology, and follow the principles of wholeness, interaction, balance, and sustainable development of the educational ecology to target and reconstruct the dynamic balance of the English speaking classroom ecology and promote the dual sustainable development of English speaking teaching and students’ speaking ability.

As students improve their knowledge structure and their critical thinking skills, their overall English skills improve significantly. In fact, however, only reading skills have improved significantly for most modern students. Other skills, especially spoken English, have been stagnant for a long time after entering school and have remained stagnant at the level of life-like expression [4]. There is a lack of connection between the teaching and learning of English in schools and the professional needs of modern students. Most students regard English courses, English majors, and personal ideas as separate entities that are difficult to create relationships and lack the ability to express professional knowledge content and personal understanding in English. In terms of output, after a period of learning English, students still have a variety of significant problems with their English speaking and writing skills, such as the living of content and the Chinese of language. Modern English language teaching requires students to learn the content of language knowledge from the shallow to the deep, while life-like language expressions are only stagnant at the bottom. If they stay at the level of life-like expressions for a long time, students will develop a kind of stereotype or fixed expression habit and then lack the desire and courage to break the fixed pattern at the psychological level.

In the past, English classes relied on a variety of auxiliary education tools, such as courseware, videos, and modern media. Even though they could strengthen the amount of educational information in a fixed time and improve the perceived effectiveness, they also restricted the initiative of teaching and learning to a certain extent. Second, teachers and students continue to use the classroom as the primary formal site for teaching and learning. In the modern era of rapid development of artificial intelligence as well as the Internet, classroom formatting may lead to the loss of a wider scope of English education. Today, many schools in China are reducing the number of classrooms in their talent development programs. If you want to complete your school’s English language education in listening, speaking, reading, writing, and general skills, it is difficult to do so with only a fixed number of hours, and some students will seek higher levels of English learning. Therefore, the formatting of the English curriculum has significantly constrained the development of English language teaching in terms of specific educational tools, educational content, and educational effectiveness [5].

Artificial intelligence is an interdisciplinary frontier discipline that is gradually transforming human thinking forms and traditional concepts and optimizing human knowledge and education. In the history of education development in our country, computer information technology provides great impetus for education reform, makes education work more efficient and effective, and makes education gradually fair and popular. In the era of artificial intelligence, these problems will be solved with the widespread application of catechism, adaptive learning systems, personal learning centers, intelligent tutors, etc., spawned by big data technology, computer vision, intelligent speech technology, and natural language processing technology [6]. The highly intelligent application of artificial intelligence in college English teaching is mainly reflected in the following aspects:

Intelligent teaching assistant system: there is a great lack of empirical research on robots supporting foreign language learning in China. For example, some learning systems create virtual interactive platforms for foreign language learners to provide interactive English listening and speaking courses [7, 8]. For example, Apple’s Siri and Baidu’s Xiaodu are intelligent machines based on big data. Machine translation systems can also be used to assist in teaching [9, 10]

Virtualized teaching: using holographic projection or VR technology, scenes from books, such as history and culture, can be presented in an immersive way to achieve true experiential immersion and improve student interest and learning [11]. “Second Life” is a free virtual 3D space developed by Linden Lab in 2003, which allows users to connect socially through speech and text. In addition to virtual social networking, “Second Life” can also be used for online teaching [12]. Teachers and students can conduct teaching activities in the various virtual spaces created by Second Life, simulating various scenarios in English listening and speaking instruction. This approach not only greatly enhances the fun of learning English for students but also increases the interactivity of learning

2.2. Status of Research on Speech Recognition Based on Deep Learning

Speech recognition is the research of how to convert speech information into text information. The areas of studies in speech can be subdivided into speech recognition, speech synthesis, and vocal recognition. It is involved in signal processing, natural language processing, etc. [13]. At the present stage, China’s scientific and technological strength has been greatly enhanced, and the old and relatively obsolete speech recognition technology is no longer able to meet the development speed of modern society. Although many intelligent terminal devices now have speech recognition function, which can complete the information exchange between human and machine, the accuracy and speed of speech recognition still need to be strengthened, and the current speech recognition algorithm and related technology are difficult to continue to develop. In this context, deep learning has become an important way to further develop speech recognition technology, which can perform pattern learning and information perception like human brain, and has a lot of theoretical research. However, deep learning is mostly in the theoretical stage and has not yet been widely applied to practical products. In order to solve this problem and promote the integration of theory and products, it is necessary to strengthen the research and development of key parts of speech recognition function, such as speech signal generation and propagation, so as to promote the better development of speech recognition technology.

In 2006, Ghasemi et al. proposed a deep belief network (DBN) with a greedy layer-by-layer unsupervised learning algorithm as its core [14]. The multilayer perceptron was pretrained by DBN and then fine-tuned by backpropagation algorithm. It provided an effective way to solve the problems of overfitting and gradient disappearance in deep network optimization. Novoa et al. contributed to the success of this practice. They used deep neural network (DNN) instead of GMM in the traditional Gaussian mixture model-Hidden Markov (GMM-HMM) system and proposed the DNN-HMM recognition method with phoneme states as the modeling unit, which significantly reduced the false recognition rate and brought it into the acceptable range for real users [15]. Compared to the GMM-HMM, the DNN replaces the GMM, and the states of the speech signal correspond to the observations using a deep neural network to build a simulated ensemble. The dimensionality of the output vector corresponds to the number of states of the HMM.

The use of convolutional neural networks (CNNs) for speech recognition mainly consists of stacking convolutional and pooling layers to obtain higher-level features. These layers are topped with a standard fully connected layer, representing the HMM state, which integrates the features trained in the network. It is better for speaker or mood changes. An increasing number of researchers have explored convolution on both the temporal and frequency axes [16].

These explorations and experiments show that the performance of CNN in DNN-HMM model is better than that of fully connected DBN. This is because DBN interprets input in any order, but in fact, the features of speech are closely related to frequency and time, and weight sharing enables CNN to capture these local correlations. Secondly, weight sharing and merging help CNN to capture equal variation differences and achieve better robustness and stability. For DBN, capturing such invariance at small frequency and time offsets requires a large number of parameters. Sainath et al. proved that CNN has better performance than DBN for large vocabulary tasks [17]. These experiments were carefully optimized by means of hyperparameter adjustment, limited weight assignment, and sequence training. Hsiao et al. studied the acoustic model of low resource language based on CNN and concluded that CNN could provide better robustness and generalization performance than DBN under the condition of low resource language [18].

Recurrent neural networks (RNN) can be used to process temporal signals. By adding feedback connections to the hidden layer, the input at the current moment is divided into two parts: (1) the input generated by the input sequence of the current moment, which is the same as the ordinary feedforward neural network, and the transmitted neural network obtains the feature representation. (2) The second is the input generated by information retained in memory from the previous moment. Through this mechanism, the RNN can take advantage of the previous information [19, 20]. This acoustic model is further studied in literature [21]. Some progress has been made by using context-dependent speech units, using context-dependent states of LSTM output space, and using distributed training methods [22, 23]. Unlike the existing methods, the method proposed in this paper combines the Mel frequency cepstral coefficients and LSTM, while incorporating a multivariate model for evaluating the quality of English pronunciation.

3. Algorithm Design

3.1. English Speech Signal Data Enhancement

Before the speech signal is analyzed and processed, it needs to be enhanced with preprocessing, including preemphasis, windowing, endpoint detection, and noise filtering. This paper adds the noise in natural scenes to the existing Tibetan language data at different signal-to-noise ratios to achieve the effect of data enhancement and data expansion, respectively. In this study, the Mel frequency cepstrum coefficient (MFCC) feature parameter based on auditory characteristics is used to transform speech from the time domain to the cepstrum domain and extract speech features. The extraction process of MFCC is shown in Figure 1. The extraction process of Mel frequency cepstrum coefficients mainly includes the steps of FFT transformation, sister filtering, and logarithmic transformation. The noise and nonrelevant contents are filtered by nonlinear transformation.

[figure(s) omitted; refer to PDF]

The main extraction algorithms are fast Fourier transform (FFT), Mel filter, logarithmic operation, and discrete cosine transform (DCT). MFCC feature parameters will be used as input to the speech recognition model [24, 25]. The speech signal preprocessing is implemented by a first-order FIR high-pass digital filter in the MATLAB system digital filter toolbox. Adding windows to process the speech waveform is done using the Hamming windows, which are implemented by the window function normalized DTFT amplitude function in the MATLAB system speech toolbox. The speech endpoint detection is implemented by Voicebox function in MATLAB system speech toolbox. The feature extraction process of speech signal based on Mel frequency cepstrum coefficient is implemented by MATLAB combined with speech toolbox programming.

3.2. LSTM-HMM-Based Model for Recognition of Spoken English

To improve the ability to fit the phoneme state distribution, LSTM is used instead of DNN and GMM. By training the LSTM, the posterior probabilities $y_{t}$ of different acoustic features can be represented, and the state S1 to S2 transfer probabilities are denoted by $a_{s 1 s 2}$ . The input feature of LSTM is MFCC. The number of nodes and hidden layers of the hidden layer can be determined according to the complexity of the task. The corresponding label data can be obtained through DNN model. The specific flow is shown in Figure 2.

[figure(s) omitted; refer to PDF]

The LSTM-HMM model first fuses contextual information through a multilayer LSTM, while the speech features are further extracted semantically through deep learning. If the input time series is long, it will inevitably exist the phenomenon of gradient disappearance; that is, the traditional RNN cannot model the long-term information very well. Because LSTM is commonly used in speech recognition research, the hidden layer neurons in a traditional RNN are replaced with LSTM memory blocks. The output of the hidden layer neurons of the LSTM is mainly completed by the LSTM memory block [26, 27]. The structure of the memory block is composed of four parts, which are memory cell, forget gate, input gate, and output gate. Among them, the memory block of LSTM mainly keeps the information that has influence on the present before the input sequence, and it is the core content of LSTM. The output of the memory cell of the previous moment together with the output of the hidden layer of the previous moment affects the memory and output of the memory block of the next moment. The function of the forget gate is to delete information in the memory block that has no effect on the present, and the function of the input gate is to keep the input useful information in the memory block. These two controls control the backward transmission of the memory delay time. The output gate controls how the output of the memory block is performed based on the current cell state.

The output of the input gate of the $i$ th LSTM memory block in the hidden layer is $\begin{matrix} (1) & {\hat{a}}_{i}^{t} = \sum_{j}^{J} w_{j i} s_{j}^{t} + \sum_{k = 1}^{K} {\bar{w}}_{k i} h_{k}^{t - 1} {\hat{w}}_{i i} s_{i}^{t - 1}, \\ (2) & {\hat{b}}_{i}^{t} = f {\hat{a}}_{i}^{t} . \end{matrix}$

Equation (1) gives the calculation process of the hidden layer of LSTM, and the result is brought into Equation (2) to get the output. The output of the forgetting gate of the $i$ th LSTM memory block in the hidden layer is $\begin{matrix} (3) & {\overset{˘}{a}}_{i}^{t} = \sum_{j}^{J} w_{j i} x_{j}^{t} + \sum_{k = 1}^{K} \bar{w_{k i}} h_{k}^{t - 1} + {\overset{˘}{w}}_{i i} s_{i}^{t - 1}, \\ (4) & {\overset{˘}{b}}_{i}^{t} = f {\overset{˘}{a}}_{i}^{t} . \end{matrix}$

That is, the outputs of both gates are determined by the current input $x_{j}^{t}$ , the output $h_{k}^{t - 1}$ of each memory block at the previous moment, and the output $s_{i}^{t - 1}$ of this memory block at the previous moment, using Sigmoid as the activation function, and the outputs of the memory blocks are shown below: $\begin{matrix} (5) & s_{i}^{t} = {\hat{b}}_{i}^{t} g a_{i}^{t} + {\hat{b}}_{i}^{t} s_{i}^{t - 1} . \end{matrix}$

3.3. A Multi-Covariate Model for Evaluating the Quality of English Pronunciation

The block diagram of the multiparameter English pronunciation quality evaluation model is shown in Figure 3. In this study, the correlation coefficient between the MFCC feature parameters of standard utterances and the MFCC features output from the speech recognition model is used as the quantitative index of pronunciation accuracy to judge whether the pronunciation is clear and accurate. The speech rate evaluation is quantified using the ratio of the standard utterance duration to the test utterance duration. The rhythm evaluation uses the pairwise variability index (PVI) proposed by Low of Nanyang Technological University, Singapore, to calculate the respective rhythmic correlation between the standard utterance and the input utterance.

[figure(s) omitted; refer to PDF]

It is worth noting that according to the durational variability feature of English speech unit durations, this paper uses the improved dPVI parameter calculation formula to compare and calculate the syllable unit segment durations of standard and test utterances separately, and the converted parameters are used as the basis for systematic evaluation, as shown in $\begin{matrix} (6) & dPVI = 100 \times \frac{\sum_{k = 1}^{m - 1} d 1_{k} - d 2_{k} + d 1_{t} - d 2_{t}}{Len}, \end{matrix}$ where $d$ is the length of the speech unit segment of the sentence division (e.g., $d_{k}$ is the length of the kth speech unit segment), $m$ is min (Std. units, test units), and $Len$ is the length of the standard utterance. Since the test utterance length has been regularized to be comparable to the standard utterance length before the PVI operation, only $Len$ can be used as the calculation unit.

3.4. Sensor Control System

The sensor control system mainly includes the following functions: the voice acquisition and recognition unit is responsible for the conversion of external sound from analog signal to digital signal and digital signal to control command. The voice remote extension processing unit is used to process the remote voice information incoming through Wi-Fi and access the information to the voice acquisition and recognition unit for processing and subsequently return the feedback information of the voice processing result to the voice remote extension unit. For better interactive experience, this system implements a traditional interface interaction terminal based on voice interaction, which is responsible for information display and system configuration. The control and feedback unit uses a network composed of ZigBee wireless sensors to control all terminal devices online, while receiving their feedback information and making corresponding processing work.

4. Experiments

4.1. Experiment Preparation

In this paper, an English speech database constructed by retrieval on the web is used. The data were mainly obtained from websites related to English language education. The dataset is the pronunciation of specific English utterances after the extraction of the 13th-order MFCC feature parameters. It includes a total of 8800 speech data (88 persons pronouncing 10 utterances, each utterance repeated 10 times), with 44 men and 44 women between the ages of 18 and 26. Before MFCC feature parameter extraction, the sampling rate is set to 16 kHz, 16-bit coding is used, and the Hamming window plus window function is used with a preemphasis filter function of $1 - 0.97 Z^{- 1}$ . The dataset is divided into $training set : validation set : test set = 60 % : 15 % : 25 %$ . The experimental environment is RAM 128 GB and Nvidia 3090 GPU; operating system and software platform are Ubuntu 20.04, TensorFlow 1.14, and Python 3.7. and Python 3.5.

LSTM uses gate structure and memory cells to control the information flow of the input model, which enables the information to be propagated over a longer period of time and has better modeling capability for contextual information. The LSTM-HMM model structure built in this paper contains three hidden layers; each layer consists of 256 memory cells. All the hidden layers in the model are connected with ReLU number with an initial learning rate of 0.008, and the output layer uses softmax function for classification with an initial learning rate of 0.001. A total of 100 iterations of Epoch are used for model training, and the loss curves and performance improvement on the training and validation sets are shown in Figures 4 and 5. As can be seen in Figures 4 and 5, the model is trained for 100 rounds, and the model performance reaches the optimum at around 80 rounds while the loss converges.

[figure(s) omitted; refer to PDF]

4.2. Analysis of the Effect of Data Enhancement Processing

In this paper, the “babble” noise in natural scenes is used to enhance the data by the method of additive noise. The waveforms of the speech data are shown in Figure 6. The original spoken training data is relatively clean, as shown in Figure 6(a), and can be regarded as the data in an ideal environment. In contrast, there are various kinds of noise in natural scenes, which have a greater impact on the speech data. Therefore, this paper performs data enhancement by adding noise as shown in Figure 6(b), and it can be seen that the speech after adding noise is closer to the data in real scenes.

[figure(s) omitted; refer to PDF]

Table 1 shows the effect of the model after training the model using the original training data and the augmented training data, where the word recognition error rates on the training set, validation set, and test set are shown. It can be seen that training with the augmented data can significantly improve the performance of the model and make the speech recognition model more robust in its representation of utterances. Figure 7 shows the change in accuracy on the validation set with and without enhancement during training. With enhancement, not only the accuracy of the model is improved but the speed of fitting is also improved.

Table 1

Effect of data enhancement on speech recognition models.

Dataset	Error rate (%)
Dataset	Without enhancement	Enhancement
Training set	35.49	30.58
Validation set	34.92	28.72
Test set	31.03	25.96

[figure(s) omitted; refer to PDF]

4.3. Analysis of the Effect of Spoken English Recognition

Table 2 shows that with the data enhancement, the end-to-end model has a recognition word error rate of 25.96%, and the error rate is reduced by 5.07%. The performance of the proposed model in this paper exceeds other models.

Table 2

Comparison with commonly used methods on the test set.

Dataset	Error rate (%)
Dataset	Without enhancement	Enhancement
GMM-HMM	33.69	31.24
DNN-HMM	30.58	29.48
CNN-HMM	29.73	27.85
Methodology of this paper	31.03	25.96

4.4. Analysis of the Effect of English Pronunciation Evaluation

Speed rate and pitch are relatively easy to evaluate. Intonation evaluation is more difficult. Intonation evaluation is designed to automatically judge whether the intonation of the pronunciation is standard by means of computation and indicate the difference between it and standard pronunciation. In intonation studies, pitch is the most basic and important constituent of intonation. Figures 8 and 9 show the articulatory intonation curves of standard and test speech extracted by the autocorrelation function method. The phonetic material form of pitch is expressed as the fundamental frequency variation of the vocal folds. From the change of fundamental frequency, different patterns of intonation height rise and fall changes can be determined; i.e., pitch can determine different patterns of intonation rise and fall. Therefore, the key to intonation evaluation is to extract the pitch corresponding to each frame of speech signal in a sentence.

[figure(s) omitted; refer to PDF]

For college student groups with different English speaking levels and with the suggestions of relevant English speech experts, various grades were set separately for different evaluation indicators (pitch, speed, rhythm, and intonation) as well as for the overall evaluation. Given that the subjectivity of the teachers in the manual evaluation process may have an influential impact on the evaluation results, the coefficient was used in this paper to test the reliability of the manual evaluation results. Moreover, the evaluation results of the two teachers were averaged (rounded off) to obtain each evaluation index and the overall score of different sentences for different students as the final manual evaluation results. The experimental results are shown in Table 3 and Figure 10. The higher the indexes in the table, the better the model proves to be. The adjacent agreement rate of all four indexes can be close to 100%, among which the evaluation validity of pitch is the most accurate.

Table 3

Analysis of pronunciation evaluation effect (%).

Indicators	Evaluation
	Diversity
	Agreement rate	Adjacent agreement rate	Pearson
Pitch	86.25	99.58	0.8
Speech	82.08	100.00	0.493
Rhythm	85.00	98.75	0.543
Intonation	80.00	98.33	0.627

[figure(s) omitted; refer to PDF]

5. Conclusions

With the fast improvement of computer information technology capabilities in the Internet era, the appearance of application algorithm platforms such as Baidu, Tencent, and Xunfei has continued to promote the deep deployment of AI applications. Intelligent applications have gradually emerged on an exploding scale and serve all levels of university English teaching. Today’s smart classrooms, smart translators, smart human-computer dialogue software, and smart writing correction software are just the beginning of language learning intelligence. In the age of information technology, school education should update its teaching concept to achieve personalized, intelligent, and interactive learning by virtue of advances in artificial intelligence technology. This paper explores the problems in the evaluation of college English voice recognition and pronunciation qualities. In the aspect of spoken English learning, some computer-assisted language study systems at home and abroad mainly focus on the learning of words and grammar, with only one or two evaluation indexes as the basis of evaluation, which have certain functional defects and can only give learners an aggregate score for their pronunciation. In terms of English speaking evaluation, English speaking test is still based on manual scoring with strong subjective will, different standards, and slow speed, which is more subjective and less repeatable and stable. To address these problems, this paper proposes an LSTM-HMM-based English speaking recognition method and considers multiparametric evaluation index method. The speech recognition model proposed in this paper has been experimentally verified to have a high accuracy rate. The adopted evaluation methods are credible and can give learners timely, accurate, and objective evaluation and feedback guidance to help learners find out the differences between their own pronunciation and the standard pronunciation, thus improving the efficiency of learning spoken English. In the future, we plan to conduct research on the integration of computer information technology and university English education using knowledge graph and graph convolution based.

Acknowledgments

This work was supported by the Teaching Reform in Hunan Province (HNJG-2021-0957): Research and practice of blended teaching mode of college English reading and writing course under the background of “gold course.”

References

[1] Z. Zhijie, "Analysis of college English teaching strategies in the context of globalization," Agro Food Industry Hi-Tech, vol. 28 no. 1, pp. 968-972, 2017.

[2] Z. Wu, H. Li, X. Zhang, Z. Wu, S. Cao, "Teaching quality assessment of college English department based on factor analysis," International Journal of Emerging Technologies in Learning, vol. 16 no. 23, pp. 158-170, DOI: 10.3991/ijet.v16i23.27827, 2021.

[3] Y. Dan-Ping, "College English interactive teaching mode under an information technology environment," Agro Food Industry Hi-Tech, vol. 28 no. 1, pp. 534-537, 2017.

[4] P. Zhou, X. Wu, H. Xu, G. Wang, "The college students’ oral English education strategy using human-computer interaction simulation system from the perspective of educational psychology," Frontiers in Psychology, vol. 12, 2021.

[5] M. Liu, "Research on college English teaching reform under" Internet plus applied talent" training mode," Agro Food Industry Hi-Tech, vol. 28 no. 3, pp. 3363-3365, 2017.

[6] L. Hu, W. Yao, "Design and implementation of college English multimedia aided teaching resources," The International Journal of Electrical Engineering & Education, vol. 2021, article 002072092098351,DOI: 10.1177/0020720920983517, 2021.

[7] J. Li, H. Chen, "Construction of case-based oral English mobile teaching platform based on mobile virtual technology," International Journal of Continuing Engineering Education and Life Long Learning, vol. 31 no. 1, pp. 87-103, DOI: 10.1504/IJCEELL.2021.111837, 2021.

[8] P. Wang, S. Qiao, "Emerging applications of blockchain technology on a virtual platform for English teaching and learning," Wireless Communications and Mobile Computing, vol. 2020,DOI: 10.1155/2020/6623466, 2020.

[9] M. Yamada, "The impact of Google neural machine translation on post-editing by student translators," The Journal of Specialised Translation, vol. 31, pp. 87-106, 2019.

[10] S. Xie, Y. Xia, L. Wu, Y. Huang, Y. Fan, T. Qin, "End-to-end entity-aware neural machine translation," Machine Learning, vol. 111 no. 3, pp. 1181-1203, DOI: 10.1007/s10994-021-06073-9, 2022.

[11] M. Holly, J. Pirker, S. Resch, S. Brettschuh, C. Gütl, "Designing VR experiences–expectations for teaching and learning in VR," Educational Technology & Society, vol. 24 no. 2, pp. 107-119, 2021.

[12] S. C. Baker, R. K. Wentz, M. M. Woods, "Using virtual worlds in education: Second Life® as an educational tool," Teaching of Psychology, vol. 36 no. 1, pp. 59-64, DOI: 10.1080/00986280802529079, 2009.

[13] O. Z. Mamyrbayev, K. Alimhan, B. Amirgaliyev, B. Zhumazhanov, D. Mussayeva, F. Gusmanova, "Multimodal systems for speech recognition," International Journal of Mobile Communications, vol. 18 no. 3, pp. 314-326, DOI: 10.1504/IJMC.2020.107097, 2020.

[14] F. Ghasemi, A. Mehridehnavi, A. Fassihi, H. Pérez-Sánchez, "Deep neural network in QSAR studies using deep belief network," Applied Soft Computing, vol. 62, pp. 251-258, DOI: 10.1016/j.asoc.2017.09.040, 2018.

[15] J. Novoa, J. Fredes, V. Poblete, N. B. Yoma, "Uncertainty weighting and propagation in DNN-HMM-based speech recognition," Computer Speech & Language, vol. 47, pp. 30-46, DOI: 10.1016/j.csl.2017.06.005, 2018.

[16] J. Zhao, X. Mao, L. Chen, "Learning deep features to recognise speech emotion using merged deep CNN," IET Signal Processing, vol. 12 no. 6, pp. 713-721, DOI: 10.1049/iet-spr.2017.0320, 2018.

[17] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. R. Mohamed, G. Dahl, B. Ramabhadran, "Deep convolutional neural networks for large-scale speech tasks," Neural Networks, vol. 64, pp. 39-48, DOI: 10.1016/j.neunet.2014.08.005, 2015.

[18] R. Hsiao, D. Can, T. Ng, R. Travadi, A. Ghoshal, "Online automatic speech recognition with listen, attend and spell model," IEEE Signal Processing Letters, vol. 27, pp. 1889-1893, DOI: 10.1109/LSP.2020.3031480, 2020.

[19] Y. F. Liao, W. T. Hong, W. J. Wang, Y. R. Wang, S. H. Chen, "An overview of RNN-based mandarin speech recognition approaches," Journal of the Chinese Institute of Engineers, vol. 22 no. 5, pp. 535-547, DOI: 10.1080/02533839.1999.9670492, 1999.

[20] S.-H. Chen, Y.-F. Liao, S.-M. Chiang, S. Chang, "An RNN-based preclassification method for fast continuous mandarin speech recognition," IEEE transactions on speech and audio processing, vol. 6 no. 1, pp. 86-90, DOI: 10.1109/89.650315, 1998.

[21] R. S. Arslan, N. Barışçı, "Development of output correction methodology for long short term memory-based speech recognition," Sustainability, vol. 11 no. 15,DOI: 10.3390/su11154250, 2019.

[22] M. C. Menard, S. Belleville, "Musical and verbal memory in Alzheimer's disease: a study of long-term and short-term memory," Brain & Cognition, vol. 71 no. 1, pp. 38-45, DOI: 10.1016/j.bandc.2009.03.008, 2009.

[23] M. Liwicki, H. Bunke, "Combining diverse on-line and off-line systems for handwritten text line recognition," Pattern Recognition, vol. 42 no. 12, pp. 3254-3263, DOI: 10.1016/j.patcog.2008.10.030, 2009.

[24] M. Qi, R. Zhou, Q. Zhang, Y. Yang, "Feature classification method of frequency cepstrum coefficient based on weighted extreme gradient boosting," Access, vol. 9, pp. 72691-72701, DOI: 10.1109/ACCESS.2021.3079286, 2021.

[25] I. Wibawa, I. Darmawan, "Implementation of audio recognition using mel frequency cepstrum coefficient and dynamic time warping in wirama praharsini," Journal of Physics Conference Series, vol. 1722 no. 1, article 012014,DOI: 10.1088/1742-6596/1722/1/012014, 2021.

[26] J. X. Bi, W. Z. Fan, S. B. Wang, "Remaining life prediction for aircraft turbine engines based on LSTM-RNN - HMM – APPROACH," IOP Conference Series: Materials Science and Engineering, vol. 1043 no. 2, 2021.

[27] F. Gao, T. Huang, J. Wang, J. Sun, A. Hussain, H. Zhou, "A novel multi-input bidirectional LSTM and HMM based approach for target recognition from multi-domain radar range profiles," Electronics, vol. 8 no. 5,DOI: 10.3390/electronics8050535, 2019.

Word count: 6568

Show less

Copyright © 2023 Juan Guo. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

English listening is an effective way to improve students’ English expression ability and use oral communication. However, from the current situation of English teaching, the current English teaching methods are too single, and teachers do not focus on oral training in the classroom, resulting in low efficiency of classroom teaching. On the basis of following the principles of wholeness, interaction, balance, and sustainable development of educational ecology, by enhancing the synergy of ecological elements of English speaking classroom, promoting interactive dialogue among ecological subjects, and regulating classroom behaviors, it is conducive to giving full play to the advantageous role of information technology on English speaking teaching reform and promoting its sustainable development. This paper addresses the current situation of English listening teaching, especially the problem of reduced recognition rate of spoken language in noisy environment, and the principle of using dual-sensor speech recognition system proposed. We design the speech recognition method based on recurrent neural network by acquiring the weak vibration pressure speech signal of the jaw skin and the speech signal transmitted through the air during the vocalization process through the sensor. Deep machine learning algorithm is used for speech recognition in English teaching. A reasonable frame sampling frequency is set to obtain the English speech signal, then the feature parameters representing this speech signal are obtained by linear prediction coefficients, and the speech feature vector is generated, followed by the recurrent neural network algorithm to train the speech features. In the related experiments, by comparing with the commonly used speech recognition algorithms, it is proved that the proposed algorithm English teaching speech recognition has higher accuracy and faster convergence.

Details

Title

Innovative Application of Sensor Combined with Speech Recognition Technology in College English Education in the Context of Artificial Intelligence

Author

Guo, Juan¹

¹ School of Foreign Languages, Hunan University of Science and Engineering, Yongzhou 425199, China

Editor

Sweta Bhattacharya

Publication year

2023

Publication date

2023

Publisher

John Wiley & Sons, Inc.

ISSN

1687725X

e-ISSN

16877268

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2023/9281914

ProQuest document ID

2777922479

Innovative Application of Sensor Combined with Speech Recognition Technology in College English Education in the Context of Artificial Intelligence

Jump to:

Full text

Abstract

Details

Suggested sources