Full Text

Turn on search term navigation

1. Introduction

Transcription of audio into text via automatic speech recognition (ASR) warrants a wide range of downstream postprocessing tasks. This transformation into natural language text enables applications such as machine translation and voice assistants. However, raw ASR outputs lack punctuation and hence do not convey the full meaning of the text. Therefore, punctuation marks need to be restored, in order to facilitate the above and related tasks. To illustrate the importance of punctuation, consider how the meaning of the sentence, “I have a favorite, family.” differs drastically from the unpunctuated version, “I have a favorite family”. This example illustrates the importance of punctuation for the intelligibility of the transcribed speech and accuracy of conveyed message.

Following the standards of punctuation restoration, we focus on three key punctuation marks, which most commonly occur and play critical roles in the language, namely commas (,), full stops or periods (.), and question marks (?). We also consider no punctuation (NP) as a fourth class for the sake of completing the space.

1.1. Related Work

Many architectures have been proposed and are devoted to restoring the punctuation of ASR transcriptions. As such, two main research categories have emerged: (1) considering only text output from the ASR, and (2) considering both the text output of an ASR and the original audio.

Most proposed systems only consider text, effectively forming a natural language processing task. They usually train and evaluate on the benchmark textual datasets from IWSLT 2011 and 2012. Researchers have studied a wide variety of methods, including n-gram models [1], recurrent neural networks [2,3,4], adversarial models [5], contrastive learning [6], and transformers [7,8]. Conditional random fields [9,10,11,12] had particularly notable success. Direct fine-tuning of BERT models [13] has also proven effective, which we demonstrate in Section 4.1 as a constituent part of our multimodal approach.

In another approach, both the audio and text modalities are considered. Earlier techniques involved statistical models such as finite state machines [14], but unsurprisingly, more recently, we have seen the exploration of neural networks [15,16] and the re-purposing of existing models to take an audio-based input and predict the punctuation [17,18]. The current state-of-the-art models take on two separate paths: one to tokenize and process text and the other to process raw audio waveforms. They then use an attention mechanism [19] to fuse text and acoustic embeddings [20,21].

1.2. Significance of Multimodal Approach

Despite research in multimodal punctuation restoration being far less numerous than the text-only category, [17] explicitly demonstrated the value of added acoustic information. Intuitively, audio provides more diverse features available for models to learn [22]. As a simple example, long pauses in speech are definitive indicators of the presence of a full stop (.). Similarly, shorter pauses may indicate a comma (,), and rising pitch is often associated with question marks (?).

The substantial benefit of involving both the transcribed text and the original speech audio is that, in practical applications, one may design a highly streamlined system for restoring punctuation. Speech can first be transcribed into text through a forward pass of the audio signal through an ASR network. However, it is possible to preserve latent representation from an intermediate layer of the ASR network (a speech embedding) for further usage as an input, along with the text embeddings derived from the transcribed text, to a separate punctuation model. Then, the concatenated input would embed not only textual information, but also the acoustic content, including prosody.

The work presented here is precisely motivated by this potential for high-speed punctuation labeling after receiving the ASR output. We present EfficientPunct (Github page: https://github.com/GitHubAccountAnonymous/PR (accessed on 24 February 2025)), a model that surpasses state-of-the-art performance while requiring far fewer parameters, enabling practical and pragmatic implementation.

1.3. Database Issues

The field of punctuation restoration has made great progress recently, from proposals of recurrent neural networks [2] and end-to-end approaches [23] to adaptations of transformers [21] and time-delay neural networks [24]. However, an emerging problem is the singular type of datasets used for evaluating English punctuation restoration models.

As mentioned previously, in the English language domain, research projects on text-only punctuation models such as [6,8,25,26] almost always utilize the IWSLT 2011 and 2012 datasets, both derived from the transcription of TED talks. Models derived from text and acoustics such as [21,23,24,27] have very often used the MuST-C dataset [28], which is also derived from TED talks. These models have steadily presented improvements in the performance and have indeed advanced the field. However, the dataset on which their insights are based is limited in scope.

TED talks are scripted monologues that are carefully practiced by the presenters before their formal delivery. Samples in the IWSLT and MuST-C datasets, therefore, exhibit high uniformity in style. Audio and transcriptions derived from this source exhibit a high quality, but they lack many natural traits of typical and spontaneous speech. For example, stutters, hesitations, and other spoken imperfections should optimally present themselves often in the punctuation restoration data, in order for trained models to be robust with respect to such disturbances. Although very few conversations between TED hosts and presenters are available, the IWSLT and MuST-C datasets predominantly feature flawless, formal monologues.

To address this limitation in the data, for thetraining and evaluation of punctuation restoration models, we have created the SponSpeech dataset (https://storage.googleapis.com/sponspeech/sponspeech.tar.gz (accessed on 24 February 2025)). The generation of this dataset has focused on extracting spontaneous, informal speech, which more accurately reflects natural conversations, rather than formal scripted speech that rarely occurs in daily life. In the process, podcasts were used as the source of the dataset, since they offer considerable stylistic diversity. They not only contain stints of discussion on focused subjects, but they also produce interactive dialogue among multiple speakers. Importantly, spoken imperfections are preserved in podcast conversations.

In addition to releasing SponSpeech as a punctuation restoration dataset of varied style, our contribution also provides another angle, namely the emphasis on utterances with punctuation ambiguity. A significant value of performing punctuation restoration is to improve readability of the text [29]. Partially, this involves eliminating ambiguity. For example, the text

$i have eight boys$

contains punctuation ambiguity, since “I have eight boys”. suggests that the speaker has eight sons, whereas, “I have eight, boys.” suggests that the speaker is indicating to their boys that they possess eight objects. On the other hand, punctuation restoration has much less value in a text, such as,

$i have an apple$

as the only way to perform the punctuation is, “I have an apple”. There is no punctuation ambiguity in the latter case.

In this dataset, we aim to include a significant number of utterances with punctuation ambiguity. In particular, we create two test sets, one with slightly more ambiguous cases than the other. Such a dataset will better evaluate the models’ ability to leverage audio patterns to resolve punctuation ambiguities. As a simple example, a longer pause, in certain contexts, may distinguish the absence or presence of a comma.

1.4. Related Databases

To train punctuation restoration models, only the corpora which include ground-truth punctuation information are sought. This is particularly noteworthy because many ASR transcription datasets are case-insensitive and lack any punctuation information, and hence are not suitable for this purpose. In the case of text-only models, most printed text would be acceptable since punctuation marks are integral to printed text. However, this work focuses on the domain that considers both text and acoustic information. Our prior research has convincingly confirmed the added value of incorporating acoustics information [17], and ASR systems can certainly be effortlessly streamlined to output audio embeddings to be used for training punctuation prediction [24].

The IWSLT pure-text dataset has approximately $2.4$ million words, which corresponds to a rough estimate of 267 h of “average” speech [30]. Table 1 presents some of the most prevalent datasets which contain both speech audio and punctuation information.

The largest dataset, Libriheavy, impressively contains over 50,000 h of audio and punctuation/casing information. A dataset of book recordings, its speech content has a scripted and uniformly flawless style. Similarly, NSC and LibriTTS involve readers verbalizing prior-written texts. As discussed, MuST-C comprises formal monologues in the form of TED talks.

Among the most popularly used datasets, only the Fisher and Switchboard-1 corpora feature spontaneous speech. Unfortunately, they require a paid membership to the Linguistic Data Consortium or a non-member licensing fee to access. We release a reasonably abundant 665 h of speech data with punctuated transcripts under a Creative Commons Attribution-NonCommercial $4.0$ International (CC BY-NC $4.0$ ) license, allowing all researchers to make use of our resource for free.

2. Methodology

The following describes our punctuation problem formulation. The spoken audio signal, $a = (a_{1}, a_{2}, \dots, a_{S})$ , and the transcribed words, $t = (t_{1}, t_{2}, \dots, t_{W})$ , are provided. Here, S represents the number of samples in the audio and W is the corresponding number of words. The goal is to predict the punctuation labels, $y = (y_{1}, y_{2}, \dots, y_{W})$ , that follow each word, for each $y_{i} \in {“, ”, “ . ”, “ ? ”, NP}$ .

As illustrated in Figure 1, EfficientPunct begins in two branches which separately process the audio signal $a$ and the transcription text $t$ . Details are provided below.

2.1. Text Encoder

First, the text sequence, $t$ , is passed through the default WordPiece tokenizer used by BERT. Then, using a pre-trained BERT model which we have fine-tuned for predicting the four previously described punctuation classes, we obtain the final hidden layer text embeddings,

(1) $\begin{matrix} H_{t} = BERT (t) . \end{matrix}$

$H_{t}$ is a matrix whose columns are 768-dimensional vectors and represent token embeddings. These text embeddings contain each token’s context-aware information about grammar and linguistics.

2.2. Audio Encoder

To process the raw spoken audio waveforms and to obtain meaningful acoustic embeddings, we use a pre-trained model, built using the Kaldi speech recognition toolkit [36]. This is directly analogous to previous works’ usage of wav2vec 2.0 [37] as their pre-trained audio encoder. Kaldi’s TED-LIUM 3 [38] framework first extracts Mel frequency cepstral coefficients (MFCCs) [22], pitch features [39], and identity vectors (i-vectors), which are then passed to a time-delay neural network (TDNN) for speech recognition. We extract the embeddings from the 12th layer of the speech recognition model, to be used in the punctuation model,

(2) $\begin{matrix} H_{a} = KaldiTedlium 12 (a) . \end{matrix}$

$H_{a}$ is a matrix whose columns are 1024-dimensional acoustic embedding vectors. The number of columns is equal to the number of frames in the original audio. The frames are 25 milliseconds long with a 10 millisecond shift, producing a 15ms overlap.

2.3. Alignment and Fusion

The first step in fusing the 768-dimensional textual embedding vectors, $H_{t}$ , and the 1024-dimensional acoustic embedding vectors, $H_{a}$ , is to find the correspondence between columns in each matrix. In other words, we must align the text token being spoken, with the corresponding frames of audio. This is performed through forced alignment. Depending on the alignment results, we concatenate the textual and acoustic embeddings related to the same frame, repeating the textual embedding for audio frames which correspond to the same token. This process results in 1792-dimensional embedding vectors corresponding to each frame. We further use a linear layer to learn affine transformations of embeddings which would be useful for punctuation restoration.

Many related works opt for attention-based fusion of the two modalities, but we have found forced alignment and a simple linear layer to be the most parameter-efficient and competitive approach. Through experiments, we have determined that more sophisticated fusion methods were counterproductive.

2.4. Time-Delay Neural Network (TDNN)

Next, the fused embeddings are passed through a time-delay neural network (TDNN) [40]. It contains a series of $1 - D$ convolutional layers to capture temporal properties of the features, with a gradually decreasing number of channels. At the last convolution layer, there are 4 channels, with each channel corresponding to a punctuation class. The channels are passed through two linear layers with weights and biases shared among them to output 4 values for the softmax activation.

2.5. Ensemble Method

In completion of the EfficientPunct, we created an ensemble of the main TDNN and predictions using BERT’s text embeddings only. We pre-trained BERT using the dark-blue and light-blue modules in Figure 1, which can still be used at the inference time to obtain a set of predictions that only consider text, grammar, and linguistics. The second set of predictions obtained from the TDNN considered both text and audio.

Let us denote $α \in [0, 1]$ to be the weight assigned to the TDNN predictions and $1 - α$ to be the weight assigned to the BERT predictions. Our final predicted punctuation will then be

(3) $\begin{matrix} f (a, t, α) = arg max [α y_{a} + (1 - α) y_{t}], \end{matrix}$

where

y_{a}

represents the TDNN softmax values and

y_{t}

represents the BERT softmax values. Essentially, if either of the TDNN or the BERT outputs a maximum class probability much lower than 1, then the other model may help resolve the ambiguity in predicting a punctuation mark.

3. Experiments with Public Datasets

3.1. Public Datasets

The primary public dataset used here was MuST-C v1 [41], which is the same as the dataset used by UniPunc [21] for the sake of a fair comparison. This dataset was compiled using TED talks. We also use the same training and test set splits as the original authors, whose information is available on our GitHub page (Github Page: https://github.com/GitHubAccountAnonymous/PR (accessed on 24 February 2025)). We further split the original training set into $90 %$ for training and $10 %$ for validation. Please see Table 2 for full detailed information.

Each sample is an English audio piece of approximately 10 $s$ to 30 $s$ with the corresponding transcription. In the Kaldi ASR implementation, we use a frame duration of 25 $m$ $s$ with a shift of 10 $m$ $s$ for the 40-dimensional MFCC computation. In addition to the MFCC features, 3 pitch features, as described in [39], are prepended to each feature vector to account for prosodic variations, important to punctuation prediction. These pitch features include the NCCF-based probability of voicing, the mean-normalized log of pitch, and the raw log value of the pitch [39]. We also compute 100-dimensional identity vectors (i-vectors) and append them to 3 consequently appended feature vectors, amounting to a 229-dimensional feature that feeds the TDNN. Then, the 12th layer of the neural network model is used to extract the acoustic embeddings. We follow the procedure described in Section 2.3 to generate a matrix of aligned embeddings for each data sample. Then, to obtain examples for training and inference, we consider segments of 301 frames, or 3 $s$ , wherein the exact middle frame is the point of transition from one text token to the next. The resulting example will thus be labeled with the punctuation following the prior token, which occurs at the middle frame. We use a context window of 3 $s$ . This duration should be sufficient to capture all the acoustic and prosodic information relevant to a punctuation mark, such as pauses and pitch variations. At the same time, this duration is not so long as to include much extraneous information, such as extensions into adjacent words.

For the entire dataset, punctuation label distributions are provided in the Table 3. Due to the highly imbalanced nature of the dataset, we sampled less prevalent classes more frequently for training such that, in effect, all class counts are equalized and the network avoids learning just the prior probability distribution.

Moreover, since BERT was already pre-trained on massive corpora, we fine-tune it for punctuation prediction using the National Speech Corpus [32] of Singaporean English, in addition to MuST-C.

3.2. Training

To fine-tune BERT and to pre-train the text encoder, we place two linear layers above the base network which is BERT’s last uncased hidden layer, geared for a 4-class classification problem. For the pre-trained audio encoder, we use the TED-LIUM 3 [38] framework, implemented in Kaldi.

Our main TDNN module for punctuation restoration comprises seven 1-dimensional convolution layers, with said dimension spanning across time. Figure 1 shows the number of input and output channels of each layer. The kernel sizes were as follows: 9, 9, 5, 5, 7, 7, and 5, alternating between no dilation and a dilation of 2. The stride was kept at 1 in all layers. Additionally, we applied the ReLU activation function and batch normalization [42] to the output of each layer. We trained the module using stochastic gradient descent [43] with a learning rate of $1 \times 10^{- 5}$ and a momentum of $0.9$ , instead of the typically used Adam optimizer [44]. This allowed for greater generalizability, but still provided reasonable training speeds [45].

To experiment with our ensemble architecture, we explored the effect of varying $α$ , the weight assigned to the TDNN for final predictions. Recall that $1 - α$ was the weight assigned to BERT. In Section 5.2, we report the results associated with a range of $α = 0.3 \to α = 0.7$ in $0.1$ increments.

We used a standard Linux computing environment hosted on the Google Cloud Platform with a single NVIDIA Tesla P100 GPU. Training took roughly 2 days, and inference could be performed on CPU-only machines 50 times faster than real time, or in about 0.02 s per second of audio.

4. Creation of the SponSpeech Database

We have created and released the SponSpeech dataset with four standard subsets, in order to facilitate the reproducibility of future research in punctuation restoration. The four subsets are train, dev (validation), test, and test-amb. The last subset, test-amb, is an evaluation set with slightly more cases of punctuation ambiguity. We expect models to perform slightly worse on test-amb than on the ordinary test set, as resolving ambiguities presents an extra challenge. Special insights must be drawn from acoustic information when multiple versions of punctuation are possible in a grammatical sense. At the same time, we do not desire for test-amb to drastically differ from utterance population norms. A dataset and its subsets should accurately reflect the state of the standard distribution of nature [46], a fact that we strived to balance, through creating a challenging, second evaluation set.

Table 4 lists the relevant statistics on each subset. We source our data from YouTube videos published under the Creative Commons Attribution 4.0 International License, rather than standard YouTube licenses. This respects the rights of all uploaders, while allowing our creation of SponSpeech as an adapted work, with the appropriate attribution. In Table 4, the fourth column reports the number of videos involved in each subset. Videos are sliced to create utterances.

The train, dev, test, and test-amb sets make up approximately $70 %$ , $12 %$ , $9 %$ , and $9 %$ of the entire dataset, respectively. The overall minimum, average, and maximum utterance durations are 1.6 s, 11.6 s, and 45.0 s, respectively.

The following subsections describe the dataset creation process.

4.1. Data Source

Each YouTube video has a unique video ID. To obtain an initial pool of YouTube video IDs, we specify a filter code in the search query URL (in this case, sp=EgQoATAB) that only allows videos with a Creative Commons license and subtitles to be returned. The resulting list of videos merely serves as candidates for inclusion in SponSpeech, with careful evaluation still needed for each video to ensure that the criteria for desirable content are met. Then, a series of five filters are applied to eliminate any unacceptable candidates.

We use the yt-dlp tool to download the videos, their subtitles, and audio content, as well as any metadata.

4.2. Filters

Figure 2 provides an illustration of the filtering pipeline. Each filter acts on the pool of candidate video IDs. Since text can be more computationally efficient to analyze than audio, the first three filters use text as the basis of their evaluation. The fourth and fifth filters evaluate the audio.

In the first stage, the subtitle availability filter simply detects whether manually uploaded English transcriptions exist for each candidate video. YouTube’s system automatically generates subtitles. However, in producing SponSpeech, we are not interested in automatically generated transcriptions. Hence, the filter requires the content creators to have provided human generated transcripts to be used as the ground-truth text.

At the second stage of the data selection, the subtitle quality filter assesses four criteria for evaluating the content of the subtitles. Subtitle files follow a standard format, with a header block of metadata to begin. This is followed by roughly alternating lines of timestamp ranges and the corresponding text. By default, subtitle files also provide accurate alignments to serve as the basis for the extraction of individual utterances. These timestamps are either provided manually by the uploader alongside the transcription, or by YouTube’s alignment algorithm, depending on the uploader’s choice.

Number of lines: Each video subtitle file is required to have at least 20 lines. This is a heuristic selected on the basis of manual examination and erring on the side of caution. The subtitle file has a header block with the metadata. As such, videos with substantial content, at the very minimum, have 20 lines of subtitles.
Timestamp formatting: The subtitle files must contain timestamps that follow a standard format, common to the vast majority of videos. Unfortunately, a small number of files have highly irregular formatting, such as those with mistakenly littered HTML tags. These are difficult to consistently interpret and are hence filtered out.
Amount of punctuation: Some subtitle files are poorly written and have very few punctuation marks. This presents a problem for our dataset, which is designed for punctuation restoration. As a result, this criterion checks for a minimum of one punctuation mark every ten subtitle lines.
Text repetitiveness: Some subtitle files have a considerable amount of overlapping text between different lines. In other words, the same phrase may appear repetitively in error, with the corresponding audio not featuring any repetition. Every line has a distinct corresponding timestamp range; the same text appearing on multiple lines may indicate alignment issues. Videos with textual repetitiveness of this type (and not a truly repetitive nature) are filtered out.

The third stage is the subtitle appropriateness filter, which is designed to eliminate videos with inappropriate language. The types of inappropriate language targeted include toxicity, severe toxicity, identity attack, threat, and sexual explicitness. While these samples do exist in the natural population of speech data, we do not wish for such language to be further propagated through a public dataset. We pass each subtitle file through the Detoxify model, available in Python’s detoxify library [47,48]. The probability of inappropriateness is computed and compared to a threshold to determine the videos to be excluded.

At the fourth stage of the process, a music filter is used to ensure that only speech content is accepted into the dataset. Even though the search keyword, “podcast”, was applied to obtain the candidate videos, YouTube contains both speech-dominant and music-dominant content. We use a wav2vec 2.0-based speech/music classifier to determine the amount of speech, music, and simultaneous content in each video [37,49]. Those files with more music than speech, measured based on a probability heuristic, were filtered out. Of course not all the videos with music were eliminated because, for example, speech overlaid with quiet music has significant validity as an utterance to train for punctuation restoration.

The fifth and last stage is the language filter, which allows for only English content to pass through, as this dataset concentrates on English content. A Whisper-based language identification model evaluates the speech from each video [50,51]. In future versions, we will be addressing other languages and multilingual conditions.

Community-trained machine learning models were also used with caution. Only those models with convincing test set results and those fine-tuned from established foundation models were used. Any performance drawbacks due to these models could be compensated by using highly cautious threshold values for filtering decisions. Threshold selection (including in sub-filter criteria) were chosen quite conservatively, in order to favor eliminating unproblematic videos, rather than failing to eliminate problematic ones. The aim is to maximize recall and somewhat overlook precision in the context of a positive class representing videos to be eliminated. The rationale is that an abundance of candidate videos would allow for a more selective process.

Videos that pass all the filters have made the basis of the presented SponSpeech dataset.

4.3. Utterance Creation

Each unit of the data sample in SponSpeech is an utterance. For each video that survived all the described filters, select sentence delimiters were used to split the video into utterances. As it has become standard in punctuation restoration research to consider only the dominant punctuation marks full stop (.), comma (,), and question mark (?), we use full stops and question marks as partition boundaries.

Standard ASR datasets use terminal punctuation marks as natural endings for utterances in the dataset [31]. However, examples with full stops and question marks are also made available within some utterances, so that punctuation restoration models may be trained robustly on the presented SponSpeech dataset.

The utterance creation step provides a pool of utterances with audio extracted using the YouTube-provided alignment information. The results are utterances distributed in the wav format with single-channel linear PCM audio at a 16 kHz sampling rate. The collection is split into the described four data subsets: train, dev, test, and test-amb. As previously mentioned, test-amb is a special test set with an increased number of utterances portraying high punctuation ambiguity. The following subsection describes the creation of the test-amb portion of the dataset.

4.4. Creating the Ambiguous Test Set

To create the test-amb portion of the dataset, the main consideration involved the determination of whether the text of each utterance contained some form of punctuation ambiguity, effectively forming a binary classification task. The null hypothesis [22] is “punctuation ambiguity exists”, with the alternative hypothesis [22] being “no punctuation ambiguity”. For a description and an example of punctuation ambiguity, please see Section 1.

The framework with which we determine punctuation ambiguity is ELECTRA [52], fine-tuned on a small number of samples. To accomplish this, we manually labeled 100 samples for training and 20 samples for validation, with a few examples shown in Table 5. For each unpunctuated textual utterance input, “label A” is associated with the list of all correct punctuation variations, separated by newline characters. “Label B” denotes binary information, with 0 representing “no punctuation ambiguity” (i.e., only one item in label A) and 1 representing “punctuation ambiguity” (i.e., more than one item exists in label A).

ELECTRA utilizes a generator–discriminator architecture. We leverage both components for binary classification as follows:

Generator tuning for text generation. The unpunctuated textual utterance input and label A are used to fine-tune ELECTRA’s generator for text generation. The model is prompted to generate all correct versions of punctuation based on the input text. This step’s task is more complex than ultimately needed, since only a final binary prediction is used to create test-amb. However, this precursor step is designed to guide the foundation model toward reasoning about the punctuation. Directly predicting label B without this step may obfuscate whether the task relates to punctuation at all.
Generator for ambiguity classification. After performing step 1, the generator is used to obtain the last hidden layer embeddings for the input text. Along with label B, these embeddings are used to train a multilayer perceptron (MLP) for binary classification. The results of this step—predictions of whether utterances contain punctuation ambiguity—are directly used in the creation of test-amb.
Discriminator for ambiguity classification. Similar to step 2, the discriminator’s embeddings from the last hidden layer for the input text and label B are used to train an MLP for binary classification. Since the discriminator is not designed for text generation, its usage does not involve the precursor fine-tuning step as it did in the generator.

In accordance with an optimal performance evaluation on the validation set, a weight of $α = 0.8$ is used for the generator’s sigmoid output probabilities, and hence, a weight of $1 - α = 0.2$ is assigned to the discriminator. As expected, the generator, tuned with the precursor generation step, is more useful. The weighted sum of the probability values are used for the final predictions of punctuation ambiguity on all SponSpeech utterances. A threshold of $β = 0.6$ was similarly selected based on the validation results, with prediction probabilities $p \geq β$ being classified as “having punctuation ambiguity”.

A formal evaluation of the fine-tuned models was not deemed necessary, since the generator, the discriminator, and the MLPs only needed to give a general suggestion as to which utterances were likely to have had punctuation ambiguity. Unlike typical machine learning tasks, maximizing accuracy was less important, since classification mistakes do not deter the overall formation of test-amb. Moreover, any additional samples labeled for a potential test set would, anyway, much better serve to enlarge the scant fine-tuning and validation data subsets.

In the end, the test-amb set was created with a greater number of utterances predicted to have punctuation ambiguity. The effectiveness of this procedure is proven in the results presented below.

5. Results

The results reported in Table 6 include a comparison of our different models with the current state-of-the-art (SOTA) and best-performing models, MuSe [20] and UniPunc [21]. The reporting of our EfficientPunct’s results is given in three categories:

EfficientPunct-BERT considers text only, which is equivalent to the fine-tuned BERT model.
EfficientPunct-TDNN considers text and audio via our TDNN.
EfficientPunct is an ensemble of predictions from categories (1) and (2) with $α = 0.4$ , the best-performing weight, as reported in Section 5.2.

Categories (1), (2), and (3) are reported in the third, fourth, and fifth rows of Table 6, respectively.

As is standard in punctuation restoration research, we report the F1 scores of commas, full stops (periods), and question marks. The “overall” F1 score aggregates all classes, while we considered varying the number of examples to account for the class-count imbalance. The number of parameters for each model is also provided, which is indicative of the computational efficiency of the method of interest.

5.1. EfficientPunct and Submodules

Our main EfficientPunct model achieves an overall F1 score of $79.5$ , outperforming all current state-of-the-art frameworks by $1.0$ or more points. We also achieve the highest F1 scores for each individual punctuation mark, with the most significant improvement occurring for question marks. These results were accomplished with EfficientPunct using less than half of UniPunc’s total number of parameters, which achieved the previously reported best results on the same data. This significant improvement in recognizing question marks may be attributed to our audio encoder, Kaldi’s TED-LIUM 3 framework, aiming explicitly at phone recognition, as we had expected. In this process, the acoustics surrounding question marks may be more pronounced in the embedding representation than other acoustics models.

Our EfficientPunct-BERT and EfficientPunct-TDNN models are even more lightweight. EfficientPunct-BERT is simply a concatenation of two linear layers and a softmax layer on top of BERT. With the incorporation of audio features, we observe that EfficientPunct-TDNN indeed performs slightly better.

These results validate the strength of the TDNN models for punctuation restoration, traditionally used in speech and speaker recognition. UniPunc and MuSe both use attention-based mechanisms for fusing text and acoustic embeddings, but alignments learned as such rely on trainable attention weights. Our forced alignment strategy likely generated more precise temporal matching between text and audio. Combined with a TDNN architecture, we achieved a significantly more efficient model.

5.2. Ensemble Weights

In this section, we observe the effect of ensemble weights on the EfficientPunct’s performance. Equation (3) details the role of $α$ in weighing the predictions made by the TDNN and BERT models, with $α = 0$ indicating only BERT predictions, and $α = 1$ indicating only those of the TDNN model.

Table 7 reports the effect of $α$ on the model performance. When both BERT and TDNN models play an approximately equal role in the ensemble, a fair voting mechanism is enabled, and the highest F1 score is achieved. However, a weight of $α = 0.4$ , which considers the BERT predictions slightly more strongly than those of the TDNN, achieves the maximum overall F1 score. This gain comes mostly from better comma predictions, which are notoriously difficult due to varying grammar and styles of composition. We believe that the superiority of the $α = 0.4$ case might be attributed to a stronger reliance on BERT’s language modeling perspective, yielding more linguistically adequate punctuation, as agreed upon by countless composition contributions to BERT’s large training corpus.

The strength of our ensemble strategy is that, in cases of uncertain predictions by either model (such as approximately equal softmax probabilities over all classes), the other model can provide guidance in clarifying the ambiguity. As indicated by the last two rows of Table 6, this process demands very few additional parameters applied on the input, but greatly improves the state-of-the-art performance.

5.3. Parameter Breakdown

In order to show the specific modules in which we attain superior efficiency, we further break down the parameters count from the last column of Table 6. Table 8 provides details of the number of parameters used by each model for extracting embeddings and inferring those embeddings to make the punctuation decisions.

Our EfficientPunct requires much less computational cost in both embedding extraction and inference stages. Our usage of Kaldi’s TED-LIUM 3 model brought massive efficiency gains compared to MuSe and UniPunc’s utilization of the more costly wav2vec 2.0. Moreover, our inference module uses less than a tenth of those of UniPunc (which had achieved the previous state of the art) at the same stage.

5.4. SponSpeech Evaluation

To evaluate the quality of our SponSpeech, we follow a procedure similar to that used for MuST-C v2 [41]. An indication of the dataset’s quality is achieving reasonable results when cross-testing with another dataset. This is in the following forms: (1) training on another dataset and testing on SponSpeech, and (2) training on SponSpeech and testing on another dataset. We use MuST-C for this purpose, since it is currently the most frequently used multimodal dataset (with text and acoustics) for punctuation restoration. The models used were BERT [13] and EfficientPunct, which are among the recent top-performing models in the text-only and multimodal domains.

As shown in Table 9, both models trained on MuST-C achieve overall F1 scores of around or above $70 %$ when tested on our SponSpeech dataset. By comparison against the SponSpeech dataset results in Table 10 and Table 11, MuST-C-trained and SponSpeech-tested results appear very reasonable. Namely, not training on the SponSpeech population itself only decreases the overall F1 score on the SponSpeech test and test-amb sets by about $2.3 %$ to $4.0 %$ . The high quality of the SponSpeech test and test-amb sets is hence confirmed. As expected, results for the test-amb set are lower than for the test set, demonstrating the effectiveness of our ELECTRA tuning process in creating a test set with more punctuation ambiguity.

The results of training on SponSpeech and testing on MuST-C are shown in Table 12 and they indicate that the overall F1 scores are well above $70 %$ . This demonstrates the quality of our SponSpeech training and validation sets. Together, the two sets of cross-testing results show that SponSpeech is of very high quality and is very robust.

5.5. SponSpeech Baselines

To initiate punctuation restoration research using SponSpeech, we provide several notable models for the baseline results, shown in Table 10 for the test set and Table 11 for the test-amb set. Training and validation are performed using the SponSpeech standard data subsets. The chosen models are as follows:

EfficientPunct, an ensemble with a time-delay neural network to process BERT [13]- and Kaldi-derived [36] embeddings, picked for its state-of-the-art performance and efficiency. It considers text and acoustics [24].
UniPunc, an attention-based architecture with a coordinate bootstrapper that allows for missing audio. We apply the same BERT and Kaldi embeddings as EfficientPunct for fair comparison of the architectures. It considers text and acoustics [21].
BERT, a bidirectional transformer-based foundation language model that can be fine-tuned for a wide variety of downstream tasks with very few additional layers. It considers text only [13].
GPT-4 Turbo, a generative pre-trained transformer large language model created by OpenAI, picked due to its immense popularity in the AI community and the public through ChatGPT and its origin model, GPT-4. It considers text only [53]. We evaluate the zero-shot performance of this model.

The best performance for all types of punctuation marks in both evaluation sets is dominated by BERT and EfficientPunct.

For the test set, BERT achieves the highest F1 scores for full stops (periods), commas, and overall, while EfficientPunct achieves the highest performance for question marks. BERT’s language modeling capabilities for punctuation restoration outmatch all other models. However, EfficientPunct’s outperformance on question marks may be attributed to the consideration of acoustics, which can often signal question marks with rising pitch, as we had expected at the onset of our research. BERT may suffer from the lack of pitch/acoustic information in this case.

For the test-amb set, BERT achieved the highest F1 scores for commas and overall, while EfficientPunct achieved the highest for full stops (periods) and question marks. All models performed worse on the test-amb set compared to the the test set, as expected.

An important finding from the baseline results is that although the multimodal models proposed thus far fall short of BERT in overall performance, experiments on MuST-C’s highly structured TED talks have repeatedly shown the proposed models’ outperformance of BERT. This highlights the importance of testing on diverse datasets, warranting the creation of our SponSpeech dataset and the ensemble versions of our EfficientPunct models.

6. Conclusions

We explored the application of time-delay neural networks to punctuation restoration, which proved to be more computationally efficient than previous approaches, yet as effective. Combined with BERT in an ensemble, EfficientPunct establishes a strong, new state of the art with a fraction of the parameters used by the previous state of the art. A key factor of the success of our models is the removal of the need for attention-based fusion of text and audio features. In previous approaches, multiple attention heads added extraordinary overhead at the punctuation prediction (inference) stage. We demonstrated that forced alignment of text and acoustic embeddings, in conjunction with temporal convolutions, rendered attention unnecessary.

EfficientPunct requires a far lower number of parameters and hence enables the postprocessing of punctuation to be performed in real time on limited-resource and edge devices. Also, the postprocessing of an ASR result should not require more effort than the ASR itself. The significant reduction in the number of parameters required for inference for EfficientPunct proves to be significantly smaller than the effort in performing the ASR.

Additionally, we studied the effect of different weightings of constituent models in an ensemble architecture. We found that a slightly stronger weighting of BERT models against the multimodal TDNN model optimized performance by emphasizing language rules associated with punctuation. The ensemble technique, however, allowed for a tuning parameter (the weight) to handle different domains, depending on the importance of the acoustics vs. the stylistic and grammatical content.

To deal with the database inadequacies, SponSpeech (https://storage.googleapis.com/sponspeech/sponspeech.tar.gz (accessed on 24 February 2025)) was created and made available. This is a new dataset for punctuation restoration with spontaneous speech in the form of podcast conversations. Many ASR datasets, especially scripted ones, lack natural tendencies in speech like stutters, random pauses, grammatical imperfections, and interruptions among multiple speakers. We remedy this dearth of important speaking characteristics in datasets by contributing SponSpeech as a public, free resource, licensed under CC BY-NC 4.0.

Several considerations are made for future work. One is examination of the effectiveness of jointly training the ensemble weights and the TDNN, which could allow for the learning of an optimal ensemble. Jointly training with the text and audio encoders may also be considered, but this procedure should not inhibit the generalizability of the encoders for purposes other than punctuation restoration. Another very important provision would be to explore the applicability of EfficientPunct (Github Page: https://github.com/GitHubAccountAnonymous/PR (accessed on 24 February 2025)) in more languages and a similar framework for other postprocessing tasks of speech recognition. This will include special punctuation provisions in languages such as Spanish where exclamation and interrogatory phrases are preceded as well as succeeded by special opening and closing marks. Eventually, we would like to expand multilingually and try to adapt our techniques, as well the dataset to also include code-switching scenarios. Finally, we would also like to increase the number of supported punctuation marks to include exclamation, semicolon, color, quotation marks, etc.

Further, with regard to data adequacy, it should be noted that certain punctuation decisions are stylistic, and oftentimes, two or more ways of punctuating are readable, reasonable, and grammatically correct, all without altering the intended meaning. Such subjectivities slightly influence punctuation labels and, in turn, the evaluation results. Measures of punctuation accuracy are hence coarser than one may initially assume. Also affected is our labeling of the training and validation set for ELECTRA fine-tuning. Especially in the cases of spontaneous speech, when grammatical rules are defied, flawless and objective punctuation labeling proves even more difficult. Our classification of whether select utterances are ambiguous may somewhat suffer from this linguistic phenomenon.

As for potential future work in relation to the dataset, multimodal model performance falling short of BERT on SponSpeech in certain categories serves as a strong motivation for further research in multimodal punctuation restoration models. It has been repeatedly shown that multimodal approaches have significant advantages over text-only models [17]. We can, therefore, conclude that there exist effective, undiscovered methods of considering acoustics for punctuation restoration that would enhance results in all punctuation categories. The most likely candidates involve heavily weighing the grammatical insights of language models while performing selective acoustic leveraging to extract fine-grained information.

Author Contributions

Conceptualization, H.B. and X.Y.L.; methodology, H.B. and X.Y.L.; data curation, H.B. and X.Y.L.; formal analysis, H.B. and X.Y.L.; project administration, H.B.; supervision, H.B.; software, X.Y.L. and H.B.; validation, X.Y.L. and H.B.; writing—original draft, X.Y.L. and H.B.; writing—review, revision, editing, H.B. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Code and data have been made available through public access links and the links have been provided in the manuscript.

Conflicts of Interest

Homayoon Beigi was employed by Recognition Technologies, Inc. and Columbia University and his work was conducted in his capacity at Recognition Technologies, Inc. Xing Yi Liu was a doctoral student at the University of Waterloo. This work is published with express permission of Recognition Technologies, Inc. There is no conflict of interest with Columbia University and University of Waterloo.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The EfficientPunct framework. The top branch predicts using text only, while the bottom branch predicts using text and audio.

View Image - Figure 2. Filtering pipeline used to create SponSpeech. Blue indicates a text-based filter, and yellow indicates an audio-based filter, also shown by the bottom-right icons. The evaluation criteria are represented by the sub-boxes within the subtitle, quality, and appropriateness filters.

Figure 2. Filtering pipeline used to create SponSpeech. Blue indicates a text-based filter, and yellow indicates an audio-based filter, also shown by the bottom-right icons. The evaluation criteria are represented by the sub-boxes within the subtitle, quality, and appropriateness filters.

Table 1

Summary of speech datasets suitable for punctuation restoration. Fisher corpus refers to the punctuated section used by [20].

Dataset	Hours	Style
SponSpeech (ours)	665	Podcasts
Libriheavy [31]	56,389	Reading
NSC [32]	2170	Reading
LibriTTS [33]	585	Reading
MuST-C v1 [28]	438	Monologues
Fisher [34]	432	Telephone
Switchboard-1 [35]	260	Telephone

Table 2

Training, validation, and test set information.

Set	Number of Samples	Total Duration (h)
Training	92,723	392.0
Validation	10,301	43.5
Test	490	2.8

Table 3

Punctuation label distributions.

Label	Number of Examples	% of Total
No punctuation (NP)	3,567,572	86.9%
Comma (,)	280,446	6.8%
Full stop (.)	238,213	5.8%
Question mark (?)	20,897	0.5%

Table 4

Statistics on each subset of SponSpeech.

Subset	Hours	Utterances	Videos
train	469	147,209	1736
dev	79	25,253	277
test	58	17,697	205
test-amb	60	16,473	397

Table 5

Data samples used to fine-tune ELECTRA for creating test-amb. Label A is manually created and lists all correct versions of punctuation for the input. Label B is binary and indicates whether punctuation ambiguity exists.

Input	Label A	Label B
and i liked the fact…	And I liked the fact…	0
…go there but you…	…go there, but you…	1
…go there but you…	…go there. But you…	1
…then education is…	…then education is…	1
…then education is…	…then, education is…	1

Table 6

F1 scores of EfficientPunct and its various submodules on each punctuation type, compared against existing state-of-the-art (SOTA) models. EfficientPunct-BERT considers text only, EfficientPunct-TDNN considers text and audio, and EfficientPunct predicts using an ensemble of Efficient-BERT and EfficientPunct-TDNN.

	Model	Embedding	,	.	?	Overall	No. of
	Model	Type(s) Used	,	.	?	Overall	Params.
SOTA	MuSe ^1,2	BERT,wav2vec 2.0	73.2	83.6	79.4	77.9	$1.7 \times 10^{8}$
SOTA	UniPunc ¹	BERT,wav2vec 2.0	74.2	83.7	80.8	78.5	$2.5 \times 10^{8}$
Ours	EfficientPunct-BERT	BERT	73.4	83.9	84.7	78.4	$1.1 \times 10^{8}$
	EfficientPunct-TDNN	BERT,TEDLIUM3	74.3	83.6	85.8	78.5	$1.2 \times 10^{8}$
	EfficientPunct (Ensemble)	BERT,TEDLIUM3	75.4	84.3	86.5	79.5	$1.2 \times 10^{8}$

¹ Statistics taken directly from the UniPunc paper due to the public inaccessibility of certain models hindering our ability to run them. Fairness of comparison is ensured, since we use the exact same training and test sets as the UniPunc authors. ² The number of parameters in MuSe was conservatively estimated from information provided in the original paper.

Table 7

F1 scores for different $α$ weights.

$α$	Comma	Full Stop	Question	Overall
0.3	75.0	84.1	86.3	79.2
0.4	75.4	84.3	86.5	79.5
0.5	75.0	84.0	86.5	79.1
0.6	75.0	83.8	86.2	79.0
0.7	74.8	83.8	85.8	78.9

Table 8

Number of parameters required in various stages of each model.

Model	Embedding	Inference	Total
Model	Network	Network	Total
MuSe	$1.6 \times 10^{8}$	$4.3 \times 10^{6}$	$1.7 \times 10^{8}$
UniPunc	$2.0 \times 10^{8}$	$4.8 \times 10^{7}$	$2.5 \times 10^{8}$
EfficientPunct	$1.1 \times 10^{8}$	$3.0 \times 10^{6}$	$1.2 \times 10^{8}$

Table 9

Precision (P), recall (R), and F1 score (F1) achieved by models trained on MuST-C and tested on SponSpeech.

Subset	Model	Full Stop			Comma			Question			Overall
Subset	Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
test	EfficientPunct	76.9	84.4	80.5	54.6	76.1	63.6	70.5	79.77	4.8	63.2	79.7	70.5
test	BERT	75.5	85.3	80.1	60.7	69.3	64.7	69.4	80.0	74.3	67.1	76.3	71.4
test-amb	EfficientPunct	76.6	80.6	78.6	55.1	76.2	63.9	67.4	77.7	72.2	63.1	78.1	69.8
test-amb	BERT	75.9	81.7	78.7	61.0	69.6	65.0	66.9	79.2	72.6	67.2	75.0	70.9

Table 10

Precision (P), recall (R), and F1 score (F1) achieved by models on SponSpeech’s test set.

Model	Full Stop			Comma			Question			Overall
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
EfficientPunct	79.3	85.6	82.3	56.9	82.6	67.4	74.0	80.4	77.0	65.2	83.8	73.3
UniPunc	70.5	83.2	76.3	46.9	75.9	58.0	71.8	71.2	71.5	55.8	78.7	65.3
BERT	82.4	82.3	82.4	66.5	75.0	70.5	76.1	77.6	76.8	73.0	78.1	75.4
GPT-4 Turbo	87.0	70.0	77.5	47.7	85.0	61.1	67.5	81.9	74.0	58.0	78.7	66.8

Table 11

Precision (P), recall (R), and F1 score (F1) achieved by models on SponSpeech’s test-amb set.

Model	Full Stop			Comma			Question			Overall
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
EfficientPunct	78.0	82.9	80.4	56.5	81.6	66.7	72.1	78.8	75.3	64.4	82.0	72.1
UniPunc	70.6	79.8	74.9	47.0	74.5	57.7	69.9	70.0	70.0	55.7	76.5	64.5
BERT	81.3	79.2	80.3	65.5	73.5	69.3	74.6	75.5	75.0	71.9	75.9	73.8
GPT-4 Turbo	85.5	66.8	75.0	48.8	84.3	61.8	65.8	80.0	72.2	58.4	77.0	66.4

Table 12

Precision (P), recall (R), and F1 score (F1) achieved by models trained on SponSpeech and tested on MuST-C. The test set used is the same as in [21,24].

Model	Full Stop			Comma			Question			Overall
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1
EfficientPunct	79.3	81.8	80.5	64.4	76.7	70.0	82.9	77.6	80.1	70.7	78.9	74.5
BERT	82.0	76.7	79.3	69.1	65.7	67.3	85.3	73.4	78.9	75.0	70.5	72.7

References

1. Gravano, A.; Jansche, M.; Bacchiani, M. Restoring punctuation and capitalization in transcribed speech. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing; Taipei, Taiwan, 19–24 April 2009; pp. 4741-4744.

2. Tilk, O.; Alumäe, T. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. Proceedings of the Interspeech; San Francisco, CA, USA, 8–12 September 2016; pp. 3047-3051.

3. Kim, S. Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing; Brighton, UK, 12–17 May 2019; pp. 7280-7284. [DOI: https://dx.doi.org/10.1109/ICASSP.2019.8682418]

4. Salloum, W.; Finley, G.; Edwards, E.; Miller, M.; Suendermann-Oeft, D. Deep learning for punctuation restoration in medical reports. Proceedings of the BioNLP 2017; Vancouver, BC, Canada, 4 August 2017; pp. 159-164.

5. Wang, W.; Liu, Y.; Jiang, W.; Ren, Y. Making Punctuation Restoration Robust with Disfluency Detection. Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design; Hangzhou, China, 4–6 May 2022; pp. 395-399.

6. Huang, Q.; Ko, T.; Tang, H.L.; Liu, X.; Wu, B. Token-level supervised contrastive learning for punctuation restoration. Proceedings of the Interspeech; Brno, Czechia, 30 August–3 September 2021; pp. 2012-2016.

7. Courtland, M.; Faulkner, A.; McElvain, G. Efficient automatic punctuation restoration using bidirectional transformers with robust inference. Proceedings of the 17th International Conference on Spoken Language Translation; Online, 9–10 July 2020; pp. 272-279.

8. Alam, T.; Khan, A.; Alam, F. Punctuation restoration using transformer models for high-and low-resource languages. Proceedings of the 2020 EMNLP Workshop W-NUT: The Sixth Workshop on Noisy User-generated Text; Online, 19 November 2020; pp. 132-142.

9. Lu, W.; Ng, H.T. Better punctuation prediction with dynamic conditional random fields. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing; Cambridge, MA, USA, 9–11 October 2010; pp. 177-186.

10. Nguyen, T.B.; Nguyen, Q.M.; Nguyen, T.T.H.; Do, Q.T.; Luong, C.M. Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models. Proceedings of the Interspeech; Shanghai, China, 25–29 October 2020; pp. 4263-4267.

11. Uyen, H.T.T.; Tu, N.A.; Huy, T.D. Vietnamese Capitalization and Punctuation Recovery Models. Proceedings of the Interspeech; Incheon, Republic of Korea, 18–22 September 2022; pp. 3884-3888.

12. Ueffing, N.; Bisani, M.; Vozila, P. Improved models for automatic punctuation prediction for spoken and written text. Proceedings of the Interspeech; Lyon, France, 25–29 August 2013; pp. 3097-3101.

13. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, MN, USA, 2–7 June 2019; pp. 4171-4186.

14. Christensen, H.; Gotoh, Y.; Renals, S. Punctuation annotation using statistical prosody models. Proceedings of the ITRW on Prosody in Speech Recognition and Understanding; Red Bank, NJ, USA, 22–24 October 2001.

15. Tilk, O.; Alumäe, T. LSTM for punctuation restoration in speech transcripts. Proceedings of the Interspeech; Dresden, Germany, 6–10 September 2015; pp. 683-687.

16. Klejch, O.; Bell, P.; Renals, S. Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing; New Orleans, LA, USA, 5–9 March 2017; pp. 5700-5704.

17. Yi, J.; Tao, J. Self-attention Based Model for Punctuation Prediction Using Word and Speech Embeddings. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing; Brighton, UK, 12–17 May 2019; pp. 7270-7274.

18. Klejch, O.; Bell, P.; Renals, S. Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches. Proceedings of the 2016 IEEE Spoken Language Technology Workshop; San Diego, CA, USA, 13–16 December 2016; pp. 433-440.

19. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations; San Diego, CA, USA, 7–9 May 2015.

20. Sunkara, M.; Ronanki, S.; Bekal, D.; Bodapati, S.; Kirchhoff, K. Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech. Proceedings of the Interspeech; Shanghai, China, 25–29 October 2020; pp. 4911-4915.

21. Zhu, Y.; Wu, L.; Cheng, S.; Wang, M. Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing; Singapore, 23–27 May 2022; pp. 7272-7276.

22. Beigi, H. Fundamentals of Speaker Recognition; Springer: New York, NY, USA, 2011.

23. Nozaki, J.; Kawahara, T.; Ishizuka, K.; Hashimoto, T. End-to-End Speech-to-Punctuated-Text Recognition. Proceedings of the Interspeech. ISCA; Incheon, Republic of Korea, 18–22 September 2022; pp. 1811-1815.

24. Liu, X.Y.; Beigi, H. Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network. Proceedings of the International Conference on Ubiquitous Information Management and Communication; Kuala Lumpur, Malaysia, 3–5 January 2024; pp. 1-6.

25. Che, X.; Wang, C.; Yang, H.; Meinel, C. Punctuation prediction for unsegmented transcript based on word vector. Proceedings of the International Conference on Language Resources and Evaluation. ELRA; Portorož, Slovenia, 23–28 May 2016; pp. 654-658.

26. Lai, V.D.; Salinas, A.; Tan, H.; Bui, T.; Tran, Q.; Yoon, S.; Deilamsalehy, H.; Dernoncourt, F.; Nguyen, T.H. Boosting punctuation restoration with data generation and reinforcement learning. Proceedings of the Interspeech. ISCA; Dublin, Ireland, 20–24 August 2023; pp. 2133-2137.

27. Kim, H.; Seo, S.; Lee, L.; Baek, S. Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation. Proceedings of the Interspeech. ISCA; Dublin, Ireland, 20–24 August 2023; pp. 1653-1657.

28. Di Gangi, M.A.; Cattoni, R.; Bentivogli, L.; Negri, M.; Turchi, M. MuST-C: A multilingual speech translation corpus. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). ACL; Minneapolis, MN, USA, 2–7 June 2019; pp. 2012-2017.

29. Tündik, M.Á.; Szaszák, G.; Gosztolya, G.; Beke, A. User-centric evaluation of automatic punctuation in ASR closed captioning. Proceedings of the Interspeech. ISCA; Hyderabad, India, 2–6 September 2018; pp. 2628-2632.

30. Brigance, W.N. How fast do we talk?. Q. J. Speech; 1926; 12, pp. 337-342. [DOI: https://dx.doi.org/10.1080/00335632609379646]

31. Kang, W.; Yang, X.; Yao, Z.; Kuang, F.; Yang, Y.; Guo, L.; Lin, L.; Povey, D. Libriheavy: A 50,000 h ASR corpus with punctuation casing and context. Proceedings of the International Conference on Acoustics, Speech and Signal Processing; Seoul, Republic of Korea, 14–19 April 2024; pp. 10991-10995.

32. Koh, J.X.; Mislan, A.; Khoo, K.; Ang, B.; Ang, W.; Ng, C.; Tan, Y.Y. Building the singapore english national speech corpus. Proceedings of the Interspeech; Graz, Austria, 15–19 September 2019; pp. 321-325.

33. Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. Proceedings of the Interspeech 2019; Graz, Austria, 15–19 September 2019; pp. 1526-1530.

34. Cieri, C.; Miller, D.; Walker, K. The Fisher corpus: A resource for the next generations of speech-to-text. Proceedings of the International Conference on Language Resources and Evaluation; ELRA, Lisbon, Portugal, 26–28 May 2004; pp. 69-71.

35. Godfrey, J.J.; Holliman, E.C.; McDaniel, J. SWITCHBOARD: Telephone speech corpus for research and development. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing; San Francisco, CA, USA, 23–26 March 1992; pp. 517-520.

36. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P. et al. The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding; Waikoloa, HI, USA, 11–15 December 2011.

37. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Proceedings of the Advances in Neural Information Processing Systems; Virtual, 2–6 December 2020; pp. 12449-12460.

38. Hernandez, F.; Nguyen, V.; Ghannay, S.; Tomashenko, N.; Esteve, Y. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. Proceedings of the 20th International Conference on Speech and Computer; Leipzig, Germany, 18–22 September 2018; pp. 198-208.

39. Ghahremani, P.; BabaAli, B.; Povey, D.; Riedhammer, K.; Trmal, J.; Khudanpur, S. A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Florence, Italy, 4–9 May 2014.

40. Phoneme recognition using time-delay neural networks. Proceedings of the IEEE Transactions on Acoustics, Speech, and Signal Processing; Online, 6 March 1989; pp. 328-339.

41. Cattoni, R.; Di Gangi, M.A.; Bentivogli, L.; Negri, M.; Turchi, M. MuST-C: A multilingual corpus for end-to-end speech translation. Computer Speech & Language Journal; Elsevier: Amsterdam, The Netherlands, 2020.

42. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning; Lille, France, 6–11 July 2015; pp. 448-456.

43. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat.; 1951; pp. 400-407. [DOI: https://dx.doi.org/10.1214/aoms/1177729586]

44. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv; 2015; arXiv: 1412.6980

45. Gupta, A.; Ramanath, R.; Shi, J.; Keerthi, S.S. Adam vs. SGD: Closing the generalization gap on image classification. Proceedings of the OPT2021: 13th Annual Workshop on Optimization for Machine Learning; Virtual, November 2021.

46. Ribeiro, M.T.; Singh, S.; Guestrin, C. ‘Why should I trust you?’ Explaining the predictions of any classifier. Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; San Francisco, CA, USA, 13–17 August 2016; pp. 1135-1144.

47. Hanu, L. Unitary Team. Detoxify. GitHub.; 2020; Available online: https://github.com/unitaryai/detoxify (accessed on 24 February 2025).

48. Hanu, L.; Thewlis, J.; Haco, S. How AI is learning to identify toxic online content. Sci. Am.; 2021; Available online: https://www.scientificamerican.com/article/can-ai-identify-toxic-online-content/ (accessed on 27 February 2025).

49. Demirkıran. Hugging Face. 2023; Available online: https://huggingface.co/FerhatDk/wav2vec2-base_music_speech_both_classification (accessed on 24 February 2025).

50. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning. PMLR; Honolulu, HI, USA, 23–29 July 2023; pp. 28492-28518.

51. Gandhi, S. Hugging Face. 2023; Available online: https://huggingface.co/sanchit-gandhi/whisper-small-ft-common-language-id (accessed on 24 February 2025).

52. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training text encoders as discriminators rather than generators. Proceedings of the International Conference on Learning Representations. ICLR; Addis Ababa, Ethiopia, 30 April 2020; pp. 1-18.

53. OpenAI. GPT-4 Technical Report; OpenAI: San Francisco, CA, USA, 2023.

Word count: 9601

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Punctuation restoration plays an essential role in the postprocessing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 point while using less than a tenth of its network parameters for inference. This work further streamlines a speech recognizer and a BERT implementation to efficiently output hidden layer acoustic embeddings and text embeddings in the context of punctuation restoration. Here, forced alignment and temporal convolutions are used to eliminate the need for attention-based fusion, greatly increasing computational efficiency and improving performance. EfficientPunct sets a new state of the art with an ensemble that weighs BERT’s purely language-based predictions slightly more than the multimodal network’s predictions. Although EfficientPunct shows great promise, from a different perspective, to date, another important challenge in the field has been the fact that punctuation restoration models have been evaluated almost solely on well-structured, scripted corpora. However, real-world ASR systems and postprocessing pipelines typically apply to spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this important discrepancy, we also introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, the authors have contributed by providing a filtering pipeline that can be used to generate more data. This filtering pipeline examines the quality of both the speech audio and the transcription text. A challenging test set is also carefully constructed, aimed at evaluating the models’ ability to leverage audio information to predict, otherwise grammatically ambiguous, punctuation. SponSpeech has been made available to the public, along with all code for dataset building and model runs.

Details

Title

Efficient Ensemble of Deep Neural Networks for Multimodal Punctuation Restoration and the Spontaneous Informal Speech Dataset

Author

Beigi, Homayoon¹

; Xing Yi Liu²

¹ Recognition Technologies, Inc., South Salem, NY 10590, USA; Department of Mechanical Engineering, Columbia University, New York, NY 10027, USA; Department of Electrical Engineering, Columbia University, New York, NY 10027, USA
² Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada; [email protected]

First page

973

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14050973

ProQuest document ID

3176377886