Content area
Full Text
As an essential means for communication among humans, speech plays a critical role in conveying information; however, speech-based interaction is susceptible to physiological constraints and environmental interference. Specifically, human vocalization is a complex process involving precise coordination among multiple organs such as the tongue, lips, jaw, vocal cords and lungs. Any disruption of the relevant pathways can lead to vocal disorders. Such disorders are often caused by neurological diseases (such as stroke, cerebral palsy, Parkinson’s disease and dementia), cancer, medical treatment (for example, the surgical treatment of laryngeal cancer), trauma and genetic disorders1. The efficiency of communication in such cases is limited, even for people with healthy vocal functions. Moreover, speech-based communication is often disrupted or blocked by loud environmental noises (for example, noisy roadsides, fires and environments such as the cockpit of an aircraft) and the absence of acoustic media (outer space)2. In the famous cocktail party scenario3, the noisy environment makes it difficult to distinguish among specific voice messages, which incapacitates the microphone or speaker. Such a constrained acoustic modality poses considerable challenges for all-purpose human language processing technologies.
To resist environmental noise and solve the problem of subvocal/silent speech interaction, scientists have researched human language processing technologies by improving the relevant algorithms or models, and upgrading the hardware. Improvements in algorithms rely on high-quality multidimensional signals provided by the hardware4. For example, speech signals have spatial characteristics (such as the spatial diversity of reverberation guidance), whereas a single microphone can collect only the temporal and spectral information of these signals while ignoring the spatial information that they contain5. Microphone arrays are thus needed for spatial selection and to obtain a high signal gain in adverse noise-related conditions; however, current hardware is incapable of collecting subtle sounds (for example, whispers and the sounds of the vocally disabled). Some researchers have tried attaching a microphone directly to the human body to obtain high-quality data from raw speech, but this is not a user-friendly set-up. Other sensors that can detect non-acoustic modalities function well in scenarios involving subtle sounds. Based on well-established technologies for image acquisition, post-processing and recognition, non-contact lip-reading-based devices for speech recognition can achieve an accuracy of over 90% on word recognition...