Advantages and Pitfalls of Dataset Condensation:

Full text

Turn on search term navigation

1. Introduction

In recent years, an unprecedented amount of data has been produced by the proliferation of digital gadgets, online platforms, and connected systems. This remarkable surge in data generation has been an essential factor in the advancement of deep learning, which has led to breakthroughs in several domains, such as natural language processing, computer vision, and speech recognition.

In the context of speech recognition, voice commands are becoming a natural way to interact with consumer electronic devices [1,2]. Systems with speech command recognition such as Amazon’s Alexa, Apple’s Siri, and Google’s Assistant are examples of this popularity. These smart devices often use some embedded systems (e.g., microcontrollers [3], microprocessors, field-programmable gate arrays, or dedicated devices [4]) with limited resources, making the implementation of speech recognition algorithms dependent on hardware limitations [5,6].

Given the hardware limitations, there is a need to obtain low-complexity solutions [7,8,9,10]. A major obstacle in training a deep learning model to an edge device is its high memory consumption, which must be decreased to fit within the device. This reduction in the amount of training data can often affect the training of the model and, consequently, its performance metrics.

The handling of such a massive volume of data requires substantial work in terms of gathering, storing, and preprocessing, among other things. Additionally, to achieve satisfactory performance, training over large datasets frequently demands high computational resources. As a result, a number of applications that rely on training in large datasets require more time and computational consumption to deal with hyperparameter optimization and neural architecture search, for example.

One solution to the challenge of training with large datasets is to seek a smaller dataset that can adequately represent the distribution of the original data. A reasonably simple way to obtain such smaller datasets is to choose the most representative or useful samples from the original dataset. In general, models trained on these subsets can then perform as well as the originals. This strategy is referred to as coreset data reduction [11,12,13,14].

Another field of study that has received attention is called dataset distillation (DD) [15], also referred to as dataset condensation (DC) [16]. The general idea is to synthesize (i.e., distill) a small number of instances that, when employed as training data, will approximate the model performance to that obtained when training on the original dataset. In other words, the method condenses the rich features of the original dataset into a compact dataset. Synthesized datasets expedite the process of neural architecture search and find diverse applications, including their use in continual learning.

DC also aligns with the few-shot concept [17], offering the benefit of providing only a few examples, yet rich in information compared to the original dataset. Few-shot learning focuses on training models to generalize from a small number of examples, while dataset distillation aims to condense large datasets into smaller, more manageable subsets without losing important information. Hence, both approaches aim to address challenges related to limited data availability.

A key indicator of the performance of a DC method is the generalization of the synthetic dataset in different network architectures. However, the evaluation results of the synthetic data generated by the existing DC methods in heterogeneous models have not reached homogeneous performance [18], and there is a significant performance drop for some DC methods when the condensed data are used to train unseen architectures [19], which motivates further studies and experiments on the subject, especially as regards to audio data.

The novelty of this work can be summarized in three main contributions:

First, for exploring the DC of time-frequency representations, a strategy for obtaining data, training and testing models by combining different speech representations is proposed, mapping the behavior of the number of data per class in the overall accuracy of each model.
A methodology is introduced for evaluating the effectiveness of the DC technique in a cross-model setup. In this context, the training architecture used for condensing the data differs from the architecture employed for synthetic data generation. We evaluate the transfer of condensed datasets to unseen architectures, indicating, in some cases, whether the effective distillation of dataset knowledge occurs in an architecture-agnostic manner. This phase encompasses a discussion on the generalization capability as well as bottlenecks identified from the experiments conducted.

This article is structured as follows. Section 2 presents theoretical aspects of DC and describes strategies for DC using gradient matching with Efficient Synthetic-Data Parameterization [20]. Section 3 and Section 4 detail and discuss the experimental setup and the results, respectively. Section 5 highlights the findings of this paper and summarizes its conclusion.

2. Strategies for Dataset Condensation

As aforementioned, the main goal of DC is to learn a small set of synthetic data from an original large-scale dataset so that models trained in the synthetic dataset can achieve comparable performances to those trained in the original one, as presented in Figure 1.

The structure of the dataset condensation problem introduced by Zhao et al. [16] is described as follows. Given a real large scale dataset consisting of $| T |$ pairs of a training data and their respective class labels, $T = {\{(x_{i}, y_{i})\}|}_{i = 1}^{| T |}$ , where $x \in X \subset R^{d}$ , $y \in Y$ so that $y \in {0, \dots, C - 1}, X$ is a d-dimensional input space and C is the number of classes. The purpose is to learn a differentiable function $ϕ$ (e.g., neural network) with parameters $θ$ that correctly predict labels of previously unseen data, $y = ϕ_{θ} (x)$ . The parameters of this function can be learned by minimizing an empirical loss term throughout the training set as

(1) $θ^{T} = \underset{θ}{arg min} L^{T} (θ),$

where

L^{T} (θ) = \frac{1}{| T |} \sum_{(x, y) \in T} ℓ (ϕ_{θ} (x), y), ℓ (\cdot, \cdot)

is a defined loss function (e.g., cross-entropy for classification tasks), and

θ^{T}

are the parameters that minimize the expected loss throughout the observed dataset.

The primary objective is to generate a compact set of synthetic samples along with their corresponding labels, $S = {\{(s_{i}, y_{i})\}|}_{i = 1}^{| S |}$ where $s \in R^{d}$ and $| S | ≪ | T |$ . Similar to Equation (1), once the condensed dataset is extracted, one can train $ϕ$ on them as follows

(2) $θ^{S} = \underset{θ}{arg min} L^{S} (θ),$

where

L^{S} (θ) = \frac{1}{| S |} \sum_{(s, y) \in S} ℓ (ϕ_{θ} (s), y)

and

θ^{S}

are the parameters that minimize the expected loss across the synthetic dataset. In this condition, the generalization performance (e.g., classification metrics) of

ϕ_{θ}^{S}

is desired to be close to

ϕ_{θ}^{T}

Different studies in DC suggest alternative optimization aims to create synthetic datasets. According to [18], a possible grouping of optimization solutions involving DC can be divided into: performance matching, distribution matching, and parameter matching.

The performance matching strategy was first proposed by Wang et al. [15]. This methodology seeks to enhance a synthetic dataset to enable neural networks trained in it to exhibit minimal loss on the original dataset. Consequently, the performance of models trained in synthetic and real datasets is aligned. The performance matching objective reflects a bi-level optimization algorithm. In the inner loops, the weights of a differentiable model with parameters $θ$ are updated with $S$ using gradient descent, and the recursive computation graph is cached. In the outer loops, models trained after inner loops are validated in $T$ , and the validation loss is backpropagated by the unrolled computation graph to $S$ .

The alternative method, known as the distribution matching approach, aims to generate synthetic data that approximate the distribution of the real data. Instead of matching training effects (e.g., the performance of models trained on $S$ ), distribution matching directly optimizes the distance between the two distributions using some metrics, such as the Maximum Mean Discrepancy (MMD) as presented in Zhao et al. [16].

This work focuses on parameter matching, also referred to as gradient matching. The concept of parameter matching within the domain of DC was initially introduced by Zhao et al. [16]. Subsequent studies further built upon this approach, extending its applications [20,21,22]. The main idea behind parameter matching is to train the same neural network using synthetic and original datasets for some stages, forcing the consistency of their losses by updates of the synthetic sample values.

Initially, $S$ is updated by minimizing the gradient distance. When updating the synthetic dataset $S$ , Zhao et al. [16] sample each synthetic and real batch pair $S_{c}$ and $T_{c}$ from $S$ and $T$ , respectively, containing samples from the c-th class, and each class of synthetic data is updated separately in each iteration. The distance $D$ between the gradients $\nabla l (S_{c}; θ)$ and $\nabla ℓ (T_{c}; θ)$ is calculated as

(3) $\begin{matrix} D (S, T; θ) = \sum_{c = 0}^{C - 1} d [\nabla ℓ (S_{c}; θ), \nabla ℓ (T_{c}; θ)], \end{matrix}$

where

d (\cdot, \cdot)

represents the similarity between two gradients and can be calculated as indicated in the original approach by using the negative cosine similarity [16]. After computing

D

, the update of the synthetic set is performed as

(4) $\begin{matrix} S_{t + 1} = S_{t} - η_{S} \nabla_{S_{t}} D (S, T; θ), \end{matrix}$

where

η_{S}

denotes the learning rate used for updating the synthetic set.

After each step of updating synthetic data, the network used for computing gradients is trained in $T$ (some authors also take into account $S$ ) for T steps. The parameters of the model are updated as

(5) $θ_{t + 1} = θ_{t} - η_{θ} \nabla_{θ_{t}} L (T_{T}; θ),$

where

η_{T}

denotes the learning rate employed in updating the parameters of the model, and

T_{T}

is the T-th mini-batch from the

T

set.

The standard approach to dataset distillation in the image classification task is to distill the information in the dataset into a few synthetic images with the same shape and number of channels as the original one. However, because of restricted storage space, the amount of information carried by a small number of data is limited [18]. Furthermore, because synthetic data have the same format as the original data, it is unclear whether it contains worthless or superfluous information. A series of publications focusing on these problems and orthogonal to DC optimization aims, provide several methods of synthetic data parameterization [16,23].

Based on empirical observations, Kim et al. [20] identified that the performance of the synthetic dataset is predominantly influenced by its size rather than its resolution. In response to this insight, they introduce a parameterization strategy for synthetic data known as Information-intensive Dataset Condensation (IDC). This strategy employs multiple formations to enhance the amount of synthetic data generated within the same storage constraints, achieved by lowering the data resolution. Essentially, the structure of synthetic data takes the shape of a downsampled version of an original image (e.g., downsampling by 2), leading to a fourfold increase in the potential number of synthetic samples.

3. Materials and Methods

For evaluating DC in KWS applications, we used 8 of the 35 classes provided by Google’s Speech Commands dataset [24], also known as Mini Speech Commands, which contains 8000 one-second audio clips. The dataset was split into training and validation/test subsets. To ensure a rigorous evaluation, we conducted a random split of the data, allocating $80 %$ of the samples for training our model and reserving the remaining $20 %$ for validation/testing. This partitioning strategy allowed us to assess the model generalization capabilities effectively while maintaining a robust evaluation framework.

Speech files were sampled at a rate of 16 kHz. For time-frequency representation, we employed the spectrogram or the Mel spectrogram (number of Mel bands equal to 32) as inputs for the models. The speech signals were segmented into frames of approximately 16 ms and overlapped by 50% (i.e., 8 ms). The discrete Fourier transform with a sliding Hann window was applied to the overlapping segments of the signal. After this, the time-frequency representations were obtained.

The audio files vary in duration and produce time-frequency representations with different dimensions. Thus, it was necessary to standardize the network input data for each set of parameters. A padding with zeros in the original signal was considered, having as a default dimension the longest duration of the training set, which is about 1 s. The speech representations undergo standardization to adjust the input scale before being fed into the models.

To obtain the condensed time-frequency representations, a ConvNet with four layers was employed. To assess the quality of the condensed dataset, six architectures were tested: a four-layer ConvNet, the same employed for generating the synthetic dataset; AlexNet [25], which includes convolutional layers, max-pooling, and dropout for effective feature extraction; SqueezeNet [26], a reduced model suitable for resource-constrained environments; VGG-11 [27], a simplified variant of the VGGNet architecture with 11 weight layers, serving as a robust baseline for image classification; EfficientNet-B0 [28], an optimized model from the EfficientNet family; MobileNet [29], designed for mobile and edge devices, utilizing depthwise separable convolutions to reduce computation while preserving accuracy and DenseNet-121 [30], featuring dense connectivity, enhancing information flow between layers for improved gradient flow and parameter efficiency. These deep learning models have significantly contributed to the field, each addressing specific challenges and diverse application scenarios. All models were initialized with pre-trained weights.

To evaluate the impact of varying data availability on model performance, we tested each of these architectures with four distinct condensed time-frequency representations per class (RPC): 5, 10, 20, and 30 instances.

As the speech command classes are balanced, the overall accuracy was used as a comparison metric between models. This evaluation metric measures the proportion of correctly classified samples in the validation/test set.

The preprocessing and feature extraction were conducted in Python with the PyTorch framework. We built our experiments upon the code provided by [20]. Experiments were carried out on a Google Colab Pro virtual machine, equipped with 16 GB of RAM, an Intel Xeon CPU running at approximately 2.20 GHz with two cores, and a Tesla T4 GPU accelerator boasting 16 GB of memory.

4. Results and Discussion

The first evaluation was conducted using the dataset in its original form, without any process of condensation applied. Accuracy was obtained for all models, taking into account both time-frequency representations. Next, we processed the speech data and obtained time-frequency representations, each of size 32 × 32. The synthetic data are initialized with random data from the original training dataset, resulting in a faster optimization process compared to random noise initialization [16].

To comprehensively assess the performance of the condensed dataset and understand the implications of cross-model generalization, we conducted two sets of experiments. First, we evaluated the condensed data using the same architecture from which it was obtained (i.e., the four-layer ConvNet model), aiming to assess how well the architecture could reproduce its own condensed knowledge. This intra-model evaluation provides insights into the effectiveness of the distillation process within the respective model.

In the second set of experiments, we tested the cross-model generalization. Here, we evaluated the condensed data using architectures distinct from the one utilized for distillation. These cross-model tests provide a measure of the extent of knowledge transfer between the models. This investigation also allows for an understanding of the versatility and adaptability of the condensed dataset, offering insights into its potential applicability in different speech recognition contexts.

4.1. Non-Condensed Data Evaluation

The six models were trained using all available instances in the training subset and then assessed within the validation subset. Table 1 presents the accuracy metrics obtained with the models evaluated. The results indicate that all models demonstrate accuracy surpassing 86.34%, with training conducted without any form of distillation.

To obtain the data shown in Table 1, pre-trained models were employed, followed by fine-tuning. The weights were adjusted using 875 representations per class during training and validated with 125 representations per class. This process was carried out over 30 epochs with a batch size of 32. The optimization algorithm used was SGD with a learning rate of $10^{- 2}$ . These models were trained with the entire dataset to provide a benchmark for comparison when trained with condensed data.

Using spectrograms as input yielded slightly better results in the models evaluated. Besides, except for the ConvNet model, which presented accuracy equal to 87.84% and 86.34%, respectively, for the spectrogram and Mel spectrogram, all the other models exhibited similar accuracy.

4.2. Intra-Model Evaluation

The subsequent step involved evaluating intra-model distillation, which was carried out within the same model trained afterward. The initial step involved distilling the original data utilizing the ConvNet model. Figure 2 presents 16 examples of spectrograms of the keyword “yes” uttered by different speakers and the respective condensed form. The initial process reduces the dimensions of the original time-frequency representation from 125 × 128 (as shown in Figure 2a) to 32 × 32 and normalizes the data (Figure 2b). In their original form, the spectrograms present a high variation in maximum amplitude. Additionally, it is possible to observe a variation in the duration of the utterance. On the other hand, the condensed form presents a normalized version of the spectrogram. Overall, the condensed representations maintain the general aspect of the spectrogram, providing a summarized form. Specifically for the keyword “yes”, the condensed version is noted to preserve the aspect of the first formant, and in certain instances, the second formant as well.

Combinations of values for typical neural network training parameters were tested. This study encompassed a range of configurations, including different types of neural networks such as three-layer and four-layer convolutional neural networks, optimization algorithms like ADAM and SGD, learning rates spanning from $10^{- 4}$ to $10^{- 2}$ , momentum values of 0.3, 0.6, and 0.9, different numbers of epochs (100, 200, and 300), varied batch sizes (64 and 96), and subsampling of temporal-frequency representation with interpolation at resolutions of 32 × 32 and 64 × 64. The results, presented in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, showcase the performance of our approach under an optimal setup. Specifically, the best configuration was identified utilizing a four-layer convolutional neural network, the SGD optimization algorithm, a learning rate of $10^{- 2}$ , a momentum of 0.9, 100 epochs, a batch size of 64, and subsampling of temporal-frequency representation with interpolation at a 32 × 32 resolution.

To establish a baseline for comparison during the validation, we selected random samples from the training set for subsequent training and testing/validation of the model. This process is performed 10 times, with accuracy calculated each time. From these runs, we computed the average accuracy obtained from the original dataset for 10 different random sets (indicated in the tables as “Random Mean”), the standard deviation, the lowest accuracy (indicated as “Random Min.”), and the highest accuracy obtained (indicated as “Random Max.”). This baseline provides a reference point for understanding the performance of our condensed data relative to random chance classification.

Table 2 presents the accuracy for the validation/test set obtained from the four-layer ConvNet trained with condensed data generated using the same ConvNet architecture, employing both spectrograms and Mel spectrograms. To assess whether the condensed information is capable of improving the models’ performance, a comparison was conducted for different RPCs.

In the first case, five instances from the original training set and five condensed instances were used to train the model. Additionally, 10, 20, and 30 instances were also tested. The “Condensed” column presents the accuracy values obtained with the distilled data. Note, that for all cases, training with condensed data significantly outperformed the maximum results achieved when the same architecture was trained with an equal number of samples from the original dataset. This behavior holds true for both the spectrogram and the Mel spectrogram.

Due to the limited number of instances provided to the models during training, a decrease in accuracy is expected when compared to the results from Section 4.1, as the model is presented with a restricted number of instances.

Notice that the results from the spectrogram are slightly superior to those from the Mel spectrogram. In addition, as the number of RPCs increases, the accuracy of the models tends to improve.

With only five instances of condensed data, it was possible to achieve accuracy exceeding 51.73%. In the same context, when presented with 30 time-frequency representations, accuracy exceeding 77.47% was attained. The results indicate that condensed data can extract information from the original dataset for training the model.

Comparing the results from RPC equal to 30, the performance achieved with condensed data is observed to be 39.42% (Spectrogram) and 28.88% (Mel spectrogram) higher than the best result from the “Random Max” column. Such results are also evident for other values of RPC.

4.3. Cross-Model Evaluation

The second condensation assessment was performed using the cross-model strategy, whereby one model is used to distill the data, and the distilled data are then employed to train other models. The condensed data were obtained using a ConvNet and trained with AlexNet, SqueezeNet, VGG-11, EfficientNet-B0, MobileNet, and DenseNet. Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 depict the results of applying dataset condensation to cross-model scenarios.

In our experiments, we observed that data condensed by simpler neural networks such as ConvNet tend to exhibit reasonable performance in various neural network architectures, such as SqueezeNet, AlexNet, and VGG-11. However, the same cannot be said for more complex architectures, such as EfficientNet-B0, MobileNet, and DenseNet. This result highlights a constraint in the respective strategy of DC.

Considering the models SqueezeNet, AlexNet, and VGG-11, we consistently obtained higher accuracy performances with the condensed data compared to those obtained with the original data for each RPC. The AlexNet model achieved the best performance in this group. For an RPC equal to 30, the accuracy obtained with the condensed data were 77.97% with the spectrogram and 74.78% with the Mel spectrogram. A slight decrease in accuracy is noticeable when compared to the previous results from the intra-model evaluation, in which the accuracy reached 80.78%.

Conversely, the EfficientNet-B0, MobileNet, and DenseNet models show a decline in performance when trained with condensed data. In the worst case, observed with the EfficientNet-B0 model, the accuracy performance varied between 17.50% (when using the Mel spectrogram as input) and 24.84% (when using the spectrogram as input). In many cases, the accuracy obtained with condensed data is lower than that obtained with the original data. The same can also be observed in other models, such as MobileNet and DenseNet.

Certain insights can be derived based on these results. Models such as ConvNet, SqueezeNet, AlexNet, and VGG-11 tend to perform better when handling small amounts of data and low-resolution inputs. While these models reach mean accuracy greater than 53.74% for RPC equal to 30 in the case of the spectrogram, EfficientNet-B0 and DenseNet achieve mean accuracy of 48.46% and 32.09%, respectively. EfficientNet-B0 and DenseNet require more data to train and also inputs with a higher resolution compared to ConvNet, AlexNet, and VGG-11.

This pitfall demonstrates that data condensed by the evaluated algorithm suffers from issues related to generalizability. In this condition, it is not possible to assert that condensed data are agnostic to different models, as is typically expected when dealing with data. The distilled data from a simple ConvNet model may not generalize well for complex models, thus emphasizing the need for tailored strategies when working with different architectures.

While simple ConvNet might offer a degree of cross-model generalization, using complex models for data distillation introduces a new set of challenges. The process of generating a condensed dataset using matching loss techniques can be particularly arduous when dealing with intricate models. The high dimensionality and intricacy of these models can make it challenging to ensure the effective convergence of distilled representations.

In our exploration, we conducted experiments to condense data using complex models. However, we encountered issues with convergence. These models, which often consist of numerous layers and parameters, require a large number of instances to be trained. In addition, distillation using complex models demands computational and time resources. The intricate architecture and numerous parameters may require significant resources during the distillation process, and results may not always meet expectations.

The same can be said regarding the assessment of the impact of certain hyperparameters, including learning rates, optimizers, and the number of layers in ConvNet. Despite our diligent investigation, we found that these adjustments did not lead to significant improvements that warranted their inclusion in this article.

Further studies will be conducted on diverse datasets, encompassing both general and comprehensive scenarios like speech commands, as well as more specific contexts such as isolated keywords or wake-words, similar to those presented in this work.

5. Conclusions

Dataset condensation is a valuable technique for reducing the size of training datasets, enabling more efficient model training and deployment. From the tests performed, we observed that data condensed by simpler neural networks, such as ConvNet, consistently delivers promising results when used to train a group of models including the ConvNet itself, the AlexNet, the SqueezeNet, and the VGG-11 models. These models appear to handle condensed data effectively, showcasing the efficacy of the technique as a means of enhancing their performance in a KWS use case. However, the results highlight the constraints of data condensation in cross-model generalization, especially when dealing with intricate neural network architectures, such as EfficientNet-B0, MobileNet, and DenseNet.

In the tests conducted, training models with condensed data yielded positive outcomes for models that do not require a large volume of data or high-resolution input. In these instances, the accuracy achieved with the model trained using condensed data exceeds the results with the same number of representations per class from the original data. The ConvNet model achieved an accuracy of 80.78% with the spectrogram and 77.47% with the Mel spectrogram using only 30 time-frequency representations condensed. The same good results are also obtained with SqueezeNet (68.79% for spectrogram and 70.49% for Mel spectrogram), AlexNet (77.97% for spectrogram and 74.78% for Mel spectrogram), and VGG-11 (69.97% for spectrogram and 68.50% for Mel spectrogram) in the same conditions.

However, complex models presented opposite results with the distilled data. The EfficientNet-B0 model, for example, is incapable of training with condensed data. The maximum accuracy obtained was 24.84% employing, as input of the model, 10 time-frequency representations extracted from spectrograms. The same pattern was observed with the MobileNet and DenseNet models, exhibiting poorer performance compared to when the model was trained with the original data. Therefore, the performance of the condensed data generated is not consistent across different models. Furthermore, determining the performance of condensed data representations generated by neural models in unseen architectures is not always clear. The challenge is to ensure that the condensed data remain representative and usable across various applications.

The results demonstrate that there is a pitfall in data condensation generated by the evaluated algorithm, which faces challenges related to generalizability. Although some authors may choose not to conduct cross-model evaluations, the results illustrate the necessity of such tests. The performed tests also demonstrate that efficient hyperparameter tuning is crucial for optimizing the distillation process.

Additionally, when attempting to distill the original dataset with complex models, issues related to non-convergence were noted. The use of complex models for data distillation can be computationally demanding, requiring careful consideration of the resources available. Therefore, a trade-off exists between the complexity of the model and its ability to distill (i.e., condensate) the data.

By addressing these challenges and understanding their nuances, the field of data condensation can continue to evolve and provide valuable solutions for data-intensive machine learning applications.

Author Contributions

Conceptualization, P.H.P. and W.B.; methodology, P.H.P. and W.B.; software, P.H.P. and W.B.; validation, P.H.P., W.B. and M.A.R.; writing—original draft preparation, P.H.P., W.B. and M.A.R.; writing—review and editing, P.H.P., W.B. and M.A.R.; supervision, P.H.P., W.B. and M.A.R.; funding acquisition, W.B. and M.A.R. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors gratefully acknowledge the financial support of the Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) and of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Example of DC applied to KWS. The goal of DC is to synthesize a limited support set of synthetic data for a model trained with these data to achieve a validation/test performance comparable to a model trained with the complete real dataset. In this example, the small synthetic set comprises only 30 keyword representations condensed from the large training dataset.

Figure 1. Example of DC applied to KWS. The goal of DC is to synthesize a limited support set of synthetic data for a model trained with these data to achieve a validation/test performance comparable to a model trained with the complete real dataset. In this example, the small synthetic set comprises only 30 keyword representations condensed from the large training dataset.

View Image - Figure 2. Representations of the keyword “yes”: (a) examples of spectrograms from the original dataset with a resolution of [Forumla omitted. See PDF.], and (b) samples from the condensed dataset with a resolution of 32 × 32.

Figure 2. Representations of the keyword “yes”: (a) examples of spectrograms from the original dataset with a resolution of [Forumla omitted. See PDF.], and (b) samples from the condensed dataset with a resolution of 32 × 32.

Table 1

Overall accuracy of the models evaluated in the validation set, considering in the first case the spectrogram and, in the second case, the Mel spectrogram as input of the model.

Time-Frequency Representation	ConvNet	AlexNet	SqueezeNet	VGG-11	EfficientNet-B0	MobileNet	DenseNet-121
Spectrogram	87.84%	94.45%	95.50%	95.51%	93.86%	95.31%	94.30%
Mel Spectrogram	86.34%	93.48%	92.40%	94.76%	92.54%	93.57%	93.42%

Table 2

Data condensation with the ConvNet model and trained with ConvNet considering the spectrogram in the first case and the Mel spectrogram as input of the model in the second case.

	Spectrogram				Mel Spectrogram
RPC	Random Mean ¹	Random Min.	Random Max.	Condensed ²	Random Mean ¹	Random Min.	Random Max.	Condensed
5	27.30 ± 0.85%	25.71%	30.13%	57.52%	30.11 ± 0.67%	28.72%	31.22%	51.73%
10	38.81 ± 0.73%	37.72%	39.84%	69.14%	46.89 ± 1.24%	45.37%	48.92%	64.47%
20	50.93 ± 1.05%	49.94%	53.41%	76.66%	56.47 ± 0.91%	55.61%	59.52%	71.89%
30	55.87 ± 0.87%	54.77%	57.94%	80.78%	59.38 ± 0.56%	58.27%	60.11%	77.47%