Full Text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Punctuation restoration plays an essential role in the postprocessing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 point while using less than a tenth of its network parameters for inference. This work further streamlines a speech recognizer and a BERT implementation to efficiently output hidden layer acoustic embeddings and text embeddings in the context of punctuation restoration. Here, forced alignment and temporal convolutions are used to eliminate the need for attention-based fusion, greatly increasing computational efficiency and improving performance. EfficientPunct sets a new state of the art with an ensemble that weighs BERT’s purely language-based predictions slightly more than the multimodal network’s predictions. Although EfficientPunct shows great promise, from a different perspective, to date, another important challenge in the field has been the fact that punctuation restoration models have been evaluated almost solely on well-structured, scripted corpora. However, real-world ASR systems and postprocessing pipelines typically apply to spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this important discrepancy, we also introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, the authors have contributed by providing a filtering pipeline that can be used to generate more data. This filtering pipeline examines the quality of both the speech audio and the transcription text. A challenging test set is also carefully constructed, aimed at evaluating the models’ ability to leverage audio information to predict, otherwise grammatically ambiguous, punctuation. SponSpeech has been made available to the public, along with all code for dataset building and model runs.

Details

Title
Efficient Ensemble of Deep Neural Networks for Multimodal Punctuation Restoration and the Spontaneous Informal Speech Dataset
Author
Beigi, Homayoon 1   VIAFID ORCID Logo  ; Xing Yi Liu 2   VIAFID ORCID Logo 

 Recognition Technologies, Inc., South Salem, NY 10590, USA; Department of Mechanical Engineering, Columbia University, New York, NY 10027, USA; Department of Electrical Engineering, Columbia University, New York, NY 10027, USA 
 Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada; [email protected] 
First page
973
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3176377886
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.