Background & Summary
The field of acoustic scene recognition and Environmental Sound Recognition (ESR) has gained significant attention over the last decades due to its practical applications across various domains. Initiatives like the Detection and Classification of Acoustic Scenes and Events (DCASE) competition have propelled advancements in computational methods for analyzing environmental sounds, providing a collaborative platform for researchers to enhance audio processing technologies1. In urban planning, ESR aids traffic management and noise pollution control by combining inexpensive hardware components like Raspberry Pi devices with deep learning algorithms for event detection2. ESR ensures safety by monitoring noise levels and triggering emergency alerts, as demonstrated by Sharma, Granmo, and Goodwin3, who developed a system using audio feature extraction techniques and deep CNN 2D models for classification. In smart homes, ESR offers automation by detecting commonly overlooked sounds such as water leaks or television noise, reducing energy wastage and potential accidents4. ESR also finds applications in assistive technologies for the visually impaired5, wildlife monitoring6, healthcare patient monitoring7, elderly care8, agriculture pest detection9, and equipment monitoring10. Reviews by Alías, Socoró, and Sevillano11 and Abayomi-Alli et al.12 focus on essential sound recognition techniques, including audio feature extraction and data augmentation methods to overcome small dataset challenges.
Researchers have explored innovative approaches such as feature aggregation from spectrograms for CNN 2D models, investigating combinations of MFCC, gammatone, log-mel, chroma, spectral contrast, and tonnetz features13,14. In embedded device applications, Abreha15, developed a smartphone-based context recognition system using machine learning classifiers with MFCC inputs. Nordby16 made contributions to city noise monitoring with low-cost microcontrollers processing mel-spectrograms. While high-end platforms like FPGAs remain underexplored, Silva et al.17 evaluated machine learning classifiers on Raspberry Pi devices. Lamrini, Chkouri, and Touhafi18 proposed utilizing pre-trained YAMNet models on resource-constrained devices by modifying dense layers to achieve state-of-the-art results.
In the automotive industry, ESR enhances autonomous vehicles’ environmental awareness through university-industry partnerships like I-SPOT (https://i-spot-project.eu/) that integrate acoustic sensing into sensor fusion networks. Bosch is developing smart acoustic sensors to detect sirens for autonomous vehicles using extensive audio data training (https://www.bosch.com/stories/embedded-siren-detection/), and the research of Marchegiani and Newman19 focuses on identifying emergency vehicle sounds using microphone arrays and image processing techniques.
The motivation behind creating the US8K_AV dataset20 was to address the lack of specific datasets tailored for autonomous vehicle applications within urban environments, more specifically, in the context of smart cities. Although there are numerous datasets available for environmental sound recognition in a broader context, none were explicitly designed for real-world applications in autonomous vehicles. Among the most cited research datasets in the literature are: UrbanSound8K (US8K)21, Primary DB, ESC-5022, ESC-1022, DCASE23, and others such as BDLib224. Within these datasets, the problem of limited labeled data, which is common in many application domains, was one of the key factors in selecting the US8K dataset as primary data, thus avoiding augmentation techniques. Additionally, many classes of the aforementioned datasets, although relevant to environmental sounds overall, were irrelevant when considered in the context of autonomous vehicle applications, as illustrated in Fig. 1.
[See PDF for image]
Fig. 1
Subjective evaluation of the classes within the datasets ESC-10, BDLib2, and US8K related to autonomous vehicles. Proposal for the classes within the US8K_AV dataset based on the US8K classes.
To tackle these limitations, we developed the dataset US8K_AV by merging less relevant classes and introducing a ‘silence’ class to improve the training and deployment of predictive algorithms in real-world scenarios, particularly within embedded systems. The proposed US8K_AV dataset provides a streamlined taxonomy that better suits the needs of autonomous vehicle research, improving the applicability of sound recognition algorithms in embedded systems. Its standardized protocols ensure compatibility with the primary data, offering a replicable methodology for researchers.
Methods
In this study, we addressed the limitations of existing datasets by developing a tailored dataset, US8K_AV, specifically designed for optimizing the performance of autonomous vehicles in urban sound environments. The methodological approach involved four key stages: primary data evaluation, strategic data merging, the addition of a new class, and the final data release.
Primary data
The US8K dataset created by Salamon, Jacoby, and Bello21, available in a proprietary website (https://urbansounddataset.weebly.com/urbansound8k.html), was developed as a baseline for academic research, focusing on real-world urban sounds. This dataset was created to support the training of scalable algorithms capable of analyzing data from sensor networks or multimedia repositories. The process began with utilizing Freesound25, an online repository of over 160,000 user-uploaded recordings under a Creative Commons license. Freesound’s extensive collection of field recordings, particularly from urban environments, provided a rich source for this endeavor. By leveraging the Freesound API, the authors efficiently searched and downloaded relevant audio files using specific class names as queries (e.g., ‘jackhammer’). These initial searches yielded over 3,000 recordings totaling just over 60 hours of audio. Through meticulous manual verification, only authentic field recordings with the desired sound class were retained, resulting in 1,302 recordings and over 27 hours of audio. Each recording was further annotated using Audacity (https://www.audacityteam.org/) to mark the start and end times of every sound occurrence, along with a salience description indicating its prominence (foreground or background) in the recording. This detailed annotation process resulted in 3,075 labeled occurrences amounting to 18.5 hours of labeled audio. To create the US8K, the authors focused on short audio snippets essential for sound source identification. They adopted a maximum occurrence duration limit of 4 seconds and segmented longer sounds into 4-second slices using a sliding window with a hop size of 2 seconds. To maintain balanced class distribution, they limited each class to 1,000 slices, producing a total of 8,732 labeled slices (audio samples), amounting to 8.75 hours.
The authors validated the dataset using 10-fold cross-validation (the reason for this particular number of k-fold was not explained). When creating the data folds, it is crucial to ensure that multiple slices from the same original recording are not inadvertently used for both training and testing, as this could result in artificially inflated classification accuracies. To prevent this issue, the authors implemented a random allocation process that assigns all slices from a single Freesound recording to the same fold. This approach also strives to balance the number of slices per fold across different sound classes. The US8K dataset available online includes audio slices organized into 10 folds, created using this methodology. This ensures that researchers who wish to compare their results with their baseline can achieve unbiased and comparable outcomes.
Strategic data merging
The US8K dataset (primary data) comprises 10 classes: ‘air_conditioner’, ‘car_horn’, ‘children_playing’, ‘dog_bark’, ‘drilling’, ‘engine_idling’, ‘gun_shot’, ‘jackhammer’, ‘siren’, and ‘street_music’. While the dataset is generally well-balanced, the number of audio samples for the classes ‘car_horn’ and ‘gun_shot’ is less than half of those in other classes. Figure 2 illustrates the number of audio samples per class and their distribution across the 10 folds.
[See PDF for image]
Fig. 2
Classes distribution among the folds of the primary data (US8K).
In the context of autonomous vehicles for smart cities, and after rigorous listening to a significant number of samples, two classes (’air_conditioner’ — 1,000 samples, and ‘gun_shot’ — 374 samples) within the dataset were deemed irrelevant and were thus removed. Four classes (’drilling’, ‘engine_idling’, ‘jackhammer’, and ‘street_music’) were considered less relevant and were merged into a new class named ‘background’. Each of these classes originally contained 1,000 audio samples; initially, 750 samples from each class were randomly deleted, and subsequently, the remaining 250 samples from each class were combined to form the new class (’background’ — 1,000 samples). The audio samples in this new class were retained in their original folds to prevent any potential data leakage from the same audio source. The classes identified as relevant —’car_horn’ (429 samples), ‘children_playing’ (1,000 samples), ‘dog_bark’ (1,000 samples), and ‘siren’ (929 samples) — were preserved. At this stage, together with the newly created class ‘background’, the dataset comprised a total of 4,358 samples.
Additionally, as illustrated in Fig. 1, the other datasets investigated in this study (ESC-10 and BDLib2) each contain only two classes of interest. Furthermore, both datasets suffer from a lack of samples and real-world representativeness (the audio samples are clean, without any background noise), which supported the decision to use the US8K dataset as primary data for creating the new tailored dataset US8K_AV.
Although the perceived simplicity of the data curation process at this stage, it is important to underscore the practical implications of these modifications. Specifically, the decision to merge certain classes into a new category was driven by the objective of enhancing the dataset’s applicability in the context of autonomous vehicles for smart cities. This strategic reclassification resulted in a notable increase in the F1-score for the relevant classes, demonstrating improved model performance and accuracy in real-world applications (additional details in subsection Classification results).
Adding a new class
In the third stage of our data preparation process, we focused on integrating a new class labeled ‘silence’ into the dataset. This involved curating and processing silent audio files from diverse environments to enhance the dataset’s versatility and realism instead of just adding a threshold in the predictive algorithm to filter sound pressure levels lower than a certain value that would be considered as silence. This integration aligns with strategies used in other neural network domains to improve robustness and calibration. For example, techniques such as out-of-distribution (OOD) detection in image classification and special tokens for unknown words in natural language processing allow models to handle inputs not belonging to predefined classes. These methods are crucial for managing unexpected inputs, thereby enhancing model reliability and safety in critical applications like autonomous vehicles.
Initially, 33 audio files were meticulously selected and downloaded from Freesound.org. These files were chosen based on their documentation as recordings of silence in various settings, ensuring a wide range of environmental contexts. Each audio file exceeded 4 seconds in duration and was labeled as foreground in the salience description. Collectively, these files amounted to 1.67 GB and 2.14 hours of audio data. To maintain consistency across the dataset, all audio files with varying encoding (e.g., ‘.wav’, ‘.flac’, ‘.aiff’, ‘.mp3’) were converted to the .wav format using 16-bit PCM encoding while preserving their original sampling rates. This standardization step was crucial for ensuring uniformity in audio quality and facilitating subsequent processing steps. The files were then renamed to retain only their unique Freesound.org IDs, simplifying identification and traceability within the dataset. Following renaming, each audio file was segmented into 4-second slices utilizing a sliding window approach with a hop size of 4 seconds. In alignment with the methodology applied during primary data processing, these audio slices were distributed into 10 distinct folds. A random allocation process was employed to ensure that all slices originating from a single Freesound recording were assigned to the same fold, thereby preserving contextual consistency. This procedure yielded a total of 550 segmented audio samples. It was decided to retain the ‘silence’ class with fewer than 1,000 samples to closely match the class distribution observed in the primary dataset, thereby maintaining a balanced representation across datasets. Subsequently, a metadata corresponding to these samples was organized into a Python dataframe and exported as a .CSV file (US8K_AV_silence.csv). This structured approach not only facilitated easy access and use of the silence samples but also promoted replicability by eliminating manual intervention during fold assignment. The dataframe containing these 550 silence samples was then merged with the existing dataframe from the previous stage, resulting in a comprehensive dataset comprising 4,908 audio samples in total.
Finally, the audio samples from the primary data were integrated into this new dataset structure according to their original fold specifications outlined in the initial metadata. Figure 3 provides detailed insights into the US8K_AV metadata content as follows:
slice_file_name: the name of the audio file which takes the following format: {fsID}-{classID}-{occurrenceID}-{sliceID}.wav, where:
{fsID} = the Freesound ID of the recording from which this sample (slice) is taken;
{classID} = a numeric identifier of the sound class;
{occurrenceID} = a numeric identifier to distinguish different occurrences of the sound within the original recording;
{sliceID} = a numeric identifier to distinguish different slices taken from the same occurrence.
fsID: the Freesound ID of the recording from which this sample (slice) is taken;
start :the start time of the slice in the original Freesound recording;
end : the end time of slice in the original Freesound recording;
salience : a (subjective) salience rating of the sound. 1 = foreground, 2 = background;
classID: A numeric identifier of the sound class (for eventual analysis, the class ‘background’ kept the original IDs from their parent classes):
0 = not used (‘air_conditioner’ was removed);
1 = ‘car_horn’;
2 = ‘children_playing’;
3 = ‘dog_bark’;
4 = ‘background’ (‘drilling’ in US8K);
5 = ‘background’ (‘engine_idling’ in US8K);
6 = not used (‘gun_shot’ was removed);
7 = ‘background’ (‘jackhammer’ in US8K);
8 = ‘siren’;
9 = ‘background’ (‘street_music’ in US8K);
10 =‘silence’.
class: ‘car_horn’, ‘children_playing’, ‘dog_bark’, ‘background’, ‘siren’, and ‘silence’.
[See PDF for image]
Fig. 3
Example of audio sample identification within the metadata.
Dataset release
In this section, Table 1 summarizes the dataset specifications available in the repository.
Table 1. Dataset release specifications.
Topic | Description |
---|---|
Subject | Computer and Information Sciences |
Specific subject area | Recognition of environmental sounds in the context of autonomous vehicles using embedded systems for smart cities. |
Type of data | Audio (processed from online source). Additionally, a table (.CSV format) describing the audio metadata from the online source. |
Data collection | The primary data comes from the dataset US8K. Irrelevant classes related to the purpose of the final dataset were merged into a new class keeping the same balance ratio from the primary dataset. An additional class was incorporated using audio samples sourced online, adhering to the same methodology as the original data collection (audio must be real-field recordings). The resultant dataset comprises 4,908 WAV files, totaling 4.94 hours of annotated audio samples, which are distributed among 6 classes and partitioned into 10 folds. Special attention was given during the partitioning process to prevent data leakage from audio samples originating from the same online source. |
Data source location | Online source (Freesound.org) of real-field recordings around the world. |
Data accessibility | Dataset is hosted in the Harvard Dataverse. |
Repository name: Harvard Dataverse. | |
Data identification number: 10.7910/DVN/4D8WPK | |
Direct URL to data: https://doi.org/10.7910/DVN/4D8WPK | |
Public access is granted by consenting to the usual License/Data Use Agreement and Terms of Use (CC-BY-NC 4.0) |
Data Records
The dataset US8K_AV is available at the Harvard Dataverse20.
Classes description
The dataset US8K_AV comprises six classes, five of which pertain to urban sounds and one designated for silence. Four classes constitute the primary data: ‘dog_bark,’ ‘children_playing,’ ‘siren,’ and ‘car_horn.’ The fifth class, ‘background,’ was created by merging random samples from the primary data, while the ‘silence’ class was constructed using audio signals sourced from Freesound.org. A detailed description of each of these classes is provided below:
‘dog_bark’: characterized by a sharp and abrupt sound, produced by canines, often indicating territorial defense or alertness to intrusions. Contains high variation in pitch and frequency over time. 0.87 h within 1,000 audio samples;
‘children_playing’: this class encompasses a range of sounds such as laughter, shouting, and running, typically generated by groups of children engaged in recreational activities in outdoor or indoor environments. 1.10 h within 1,000 audio samples;
‘siren’: mainly loud, high-pitched, oscillating tone used primarily by emergency vehicles such as ambulances, fire trucks, and police cars to signal their presence and urgency in traffic. 1.01 h within 929 audio samples;
‘car_horn’: produced by motor vehicles, car horns generate a short, high-decibel honk intended to alert other drivers and pedestrians to potential hazards or to communicate driver intentions. 0.29 h within 429 audio samples;
‘background’: comprising a mixture of drilling, engine idling, jackhammer operations, and street music, this class represents ambient urban noise often encountered in metropolitan areas. The diverse soundscape serves as a backdrop that may mask or interfere with the detection of other distinct auditory events. 1.05 h within 1,000 audio samples;
‘silence’: characterized by a near-absence of discernible sound waves, this class represents periods of minimal or no auditory activity. Silence is essential for calibrating baseline noise levels and for distinguishing between meaningful audio events and background noise. Moreover, silence detection is useful for event segmentation. It allows the neural network to identify the boundaries between different sound events, making it easier to segment and analyze the audio data. It improves the overall performance of the ESR by providing clearer distinctions between different sound events. 0.61 h within 550 audio samples.
Dataset statistics
In this section, we provide detailed information about the dataset’s characteristics for each of the six classes. The dataset, located in the folder labeled “data” (see the next subsection), comprises 4,908 WAV files with a total size of 3.27 GB and a cumulative duration of 4.94 hours of annotated audio samples. Table 2 presents the total size, number of files, unique sampling rates, and average sample duration for each class. It is important to note that while the unique sampling rates range from 8,000 Hz to 192,000 Hz, this diversity serves only to illustrate the variety in audio sample recordings. All samples will be normalized to a sampling rate of 22,050 Hz during the digitization process for consistency and standardization.
Table 2. Total size, number of files, unique sampling rates, and average duration corresponding to each class of the US8K_AV dataset.
Fold | Folder name | Nr. files | Size (MB) | Min, Max sampling rates (Hz) | Av. sample duration (s) |
---|---|---|---|---|---|
1 | fold1 | 478 | 350.26 | 8,000 to 96,000 | 3.59 |
2 | fold2 | 485 | 334.86 | 11,025 to 96,000 | 3.58 |
3 | fold3 | 536 | 332.36 | 11,025 to 96,000 | 3.68 |
4 | fold4 | 599 | 422.08 | 11,025 to 96,000 | 3.63 |
5 | fold5 | 529 | 334.56 | 16,000 to 96,000 | 3.62 |
6 | fold6 | 460 | 276.28 | 11,025 to 96,000 | 3.62 |
7 | fold7 | 465 | 299.23 | 11,025 to 96,000 | 3.69 |
8 | fold8 | 441 | 279.12 | 8,000 to 192,000 | 3.58 |
9 | fold9 | 447 | 312.98 | 11,025 to 96,000 | 3.61 |
10 | fold10 | 468 | 328.68 | 11,025 to 96,000 | 3.62 |
File structure
After downloading and unzipping the files from the repository, the dataset structure in its root directory will be as illustrated in Fig. 4, which provides a detailed depiction of the organization of data folders. Complementarily, Fig. 5 demonstrates the equitable distribution of classes across the various folds, ensuring balanced representation within the dataset. The metadata US8K_AV_metadata.csv, outlines a comprehensive procedure to recreate this dataset from its original source at Freesound.org, as described in section Methods.
[See PDF for image]
Fig. 4
Data structure layout.
[See PDF for image]
Fig. 5
Classes distribution among the folds of the US8K_AV.
Technical Validation
The dataset and the following experiments were developed using Python 3.9, extensively using the Librosa library for audio processing26. The scripts for feature extraction, normalization, construction of the predictive models, quantization of the neural networks and deployment in the Raspberry Pi are publicly available in a GitHub repository as part of our Master’s thesis project (see section Code availability).
Experimental design
We evaluated two distinct methodological approaches to identify the predictive algorithm that offers the best trade-off between average accuracy, standard deviation, and total prediction time (including feature extraction of the testing samples). To establish a benchmark of accuracy results for future researchers, We also conducted experiments with all the predictive models by considering the inference results for the entire audio signal (4 s) and for a sliding window approach that clips the audio signal into chunks of approximately 1 s, overlapping each other by 50%. The latter experiment is crucial for understanding the dataset’s usability in real-life applications where minimizing total prediction time is essential while maintaining a high average accuracy and low standard deviation. Finally, we deployed the predictive models from experiment 1 (sliding window), both in the notebook and in the Raspberry Pi, to evaluate the best predictive algorithm using a spider web chart. These models were trained with the audio samples from folds 2 to 10 and tested with the data from fold 1. This choice was based on the presence of fold 1 among the top 5 models of the 10-fold cross-validation models.
Hardware and software
The algorithms for the predictive models were written using Python, either in PyCharm or in Jupyter notebooks, along with the scikit-learn library for machine learning techniques and ensemble method (https://scikit-learn.org/stable/), as well as the Keras library for neural networks construction (https://keras.io/) which is built on the Tensorflow framework (https://www.tensorflow.org/).
The training and testing of the models were performed on a notebook equipped with an Intel®Core™ i7-10850H CPU @2,70 GHz, 80 GB of RAM, and a Quadro T2000 graphics card with 4 GB of memory. The operational system in use was Windows 10 Enterprise. Notably, the GPU resources of the Quadro T2000 were instrumental in training the neural networks leading to training speeds approximately four times faster than when exclusively utilizing the CPU.
The subject evaluation was carried out in a Raspberry Pi 4 model B, with 8 GB SDRAM and Quad core Cortex-A72 (ARM v8) 64-bit SoC @1.8 GHz. To capture live audio as input for the predictive model, we utilized an external microphone commonly used in the aftersales market (https://www.ckmova.com/lum2-p00120p1.html), model USB Lavalier LUM2 series, condenser type, omnidirectional, with frequency response of 50 Hz to 20 kHz, signal-to-noise ratio 50 dB SPL, and sensitivity of -34 dB ± 3 dB.
Experiments training the models
Here, we delineate the training configuration employed for the experiments, encompassing two distinct methodological approaches. The initial approach entails the development of handcrafted features14,27 as input for a predictive model using the most frequent classifiers cited in the literature regarding environmental sound recognition, namely: Support Vector Machine (SVM), Logistic Regression (LR), ensemble method of Random Forest (RF), Artificial Neural Network (ANN), and Convolutional Neural Network 1D (CNN 1D). The second approach leverages the technique of aggregated features13 as input for a Convolutional Neural Network 2D (CNN 2D).
In both approaches, we initially employed a normalization process28 comprising duration adjustment to set each original audio sample to 4 s together with silence or near-silence trimming at a 60 dB threshold. The dataset was evaluated using a 10-fold cross-validation with predefined folds, as illustrated graphically in Fig. 6. In this method, we trained our models on data from 9 out of the 10 predefined folds and tested them on the remaining fold. This process was repeated 10 times, each time using a different set of 9 folds for training and the last fold for testing. The classification metrics were reported as the average accuracy and standard deviation. By following this method, besides adhering to the primary data specifications21, we ensured that our models were evaluated on completely unseen data in each iteration, providing a more reliable measure of their true performance.
[See PDF for image]
Fig. 6
Stratified 10-fold-cross validation of the dataset.
Audio features
In the realm of audio analysis, audio features are typically categorized into two primary types: physical features and perceptual features. Physical features are quantitative metrics derived directly from the sound wave, such as the energy function, spectrum, cepstral coefficients, and fundamental frequency. In contrast, perceptual features pertain to how humans perceive sound, including characteristics like loudness, brightness, pitch, timbre, and rhythm11,29. Moreover, physical features can be further divided into temporal features and spectral features based on their signal domain categorization. The following selection of specific physical features for use in ESR applications was obtained by an extensive literature review, with highlights for the spectral features utilized by Lhoest et al.27, Silva et al.17, and Bountourakis, Vrysis, and Papanikolaou24.
Handcrafted features
In this approach, the audio features Zero-Crossing Rate (ZCR), Root Mean Square (RMS), Spectral Centroid (SC), Spectral Bandwidth (SB), Spectral Roll-off Point (SRP), Spectral Contrast (SCT), Tonnetz, Chroma feature, Mel spectrogram, Mel-Frequency Cepstral Coefficients (MFCC), and two feature manipulations namely Delta (Δ) MFCC and Delta-Delta (ΔΔ) MFCC were extracted using the Python library Librosa. Features with only one coefficient had their per-frame values calculated by mean, on the other hand, the features containing more than one coefficient, had their per-frame values for each coefficient summarized across time using statistics, namely mean, median, standard deviation (std), skewness, and kurtosis. The exceptions were Mel with only mean, Delta MFCC, and Delta-Delta MFCC with mean and std calculated. The result was a vector containing 375 features as shown in Table 3.
Table 3. Composition of the features array utilized in the machine learning, ANN, and CNN 1D classifiers after the feature extraction process.
Feature | Coefficients | Mean | Mean, std | Mean, std, median, skewness, kurtosis | ∑ |
---|---|---|---|---|---|
RMS | 1 | 1 | — | — | 1 |
ZCR | 1 | 1 | — | — | 1 |
SC | 1 | 1 | — | — | 1 |
SB | 1 | 1 | — | — | 1 |
SRP | 1 | 1 | — | — | 1 |
Mel | 128 | 1 | — | — | 128 |
Δ MFCC | 13 | — | 2 | — | 26 |
ΔΔ MFCC | 13 | — | 2 | — | 26 |
MFCC | 13 | — | — | 5 | 65 |
SCT | 7 | — | — | 5 | 35 |
Chroma | 12 | — | — | 5 | 60 |
Tonnetz | 6 | — | — | 5 | 30 |
Total | 375 |
The 10-fold cross-validation for the machine learning and ensemble method models utilized the standard parameters of the Python library scikit-learn, except for the following hyperparameters:
SVM: kernel = ‘linear’, C = 0.50 (Regularization parameter);
LR: solver = ‘saga’, C = 0.50 (Regularization parameter), max_iter = 500;
RF: criterion = ‘gini’, n_estimators = 500 (The number of trees in the forest), bootstrap = True (Whether bootstrap samples are used when building trees).
In the helm of neural networks, the ANN was built using Keras sequential model to define a Multilayer Perceptron with the first dense layer receiving as input the 375 features vector and ReLU activation function. The second and third dense layers were hidden, with 375 neurons and 750 neurons respectively, both using ReLU activation function. A dropout layer with 20% rate was also added after each hidden layer to improve generalization, prevent overfitting, and make the model more robust by introducing randomness during training. The final dense layer with 6 neurons (one for each class) is utilized for the classification task by using softmax activation function to assign probabilities to each class in the output layer. The Keras hyperparameters utilized in this model configuration (Fig. 7) were:
loss = ‘categorical_crossentropy’;
optimizer = Adam;
learning_rate = 0.0001;
beta_1 = 0.5;
beta_2 = 0.999;
epsilon = 1e-0;
amsgrad = True.
[See PDF for image]
Fig. 7
Representation of the ANN architecture.
Cross-validation was performed with the following parameters after several iterations of grid-search evaluation based on the model’s accuracy results:
batch_size = 20;
epochs = 350;
EarlyStopping with monitor = ‘val_accuracy’, min_delta = 0,0001, patience = 150;
restore_best_weights = True.
The architecture of the CNN 1D model built with Keras, as illustrated in Fig. 8, consisted of several layers, compiled with a categorical cross-entropy loss function, adamax optimizer, and accuracy as the evaluation metric:
The first convolutional layer has 28 filters with a kernel size of 7 x 1, and it uses the ReLU activation function, receiving as input the 375 features vector;
Follows the second convolutional layer with 34 filters and kernel size of 5 x 1. It also uses the ReLU activation function and applies L2 regularization to the kernel with 0.001 and bias with 0.01. Padding is set to ‘same’ with stride as default (1) to ensure the output size matches the input size;
The last convolutional layer has 56 filters with a kernel size of 3 x 1. The same parameters of the second convolutional layer are used for the activation function, and L2 regularization for the kernel and bias;
After the convolutional layers, max pooling is performed with a pool size of 2 x 1, reducing the dimensionality of the data by taking the maximum value within each window of size 2 x 1;
To reduce overfitting, a dropout layer is added to randomly set 20% of the input units to 0 at each update during training;
To prepare the data for the next step, a flatten layer reshapes the output of the previous layer into a 1-dimensional vector;
Follows a hidden dense layer with 50 neurons fully connected to the previous layer;
Finally, the classification task is performed in the last layer. It has a number of neurons equal to the number of output classes (6), and it uses the softmax activation function to output probabilities for each class.
[See PDF for image]
Fig. 8
Representation of the CNN 1D architecture.
In the same way as the ANN, the cross-validation was performed with the following parameters:
batch_size = 20;
epochs = 150;
EarlyStopping with monitor = ‘val_accuracy’, min_delta = 0,0001, patience = 50;
restore_best_weights = True.
Aggregated features
The second approach leverages the renowned capabilities of Convolutional Neural Networks (CNNs) in two-dimensional image classification tasks by utilizing a pseudo-image format based on log-mel spectrograms, along with their first and second derivatives, known as Delta (Δ) and Delta-Delta (Δ Δ), respectively. The log-mel spectrogram provides a visual representation of an audio signal’s frequency content over time, computed using the Short-Time Fourier Transform (STFT) and triangular filters. In this representation, the frequency axis (Y-axis) is scaled according to the Mel scale (Hz), which approximates human auditory perception30, while the amplitude values are converted to a logarithmic scale (dB). The shape of the pseudo-image is contingent upon the audio duration, represented by the number of frames containing the audio samples. The number of frames (X-axis) is defined in experiments 1 & 2, at this moment, for clarity, we consider 44 frames (equivalent to 1 s of audio — 4 s of audio results in 173 frames) and 60 Mel bands, resulting in a log-mel spectrogram of dimensions 60 x 44.
The first and second derivatives are computed to capture temporal dynamics and the acceleration of spectral changes, respectively, while retaining identical dimensions as the original log-mel spectrogram. These three arrays are subsequently stacked along the frequency band axis, producing a composite array with dimensions 180 x 44 (Fig. 9). By feeding this pseudo-image into the CNN, the resultant feature map encapsulates both static spectral properties and dynamic temporal variations, thereby enhancing the discriminative power of the classifier.
[See PDF for image]
Fig. 9
Log-mel spectrogram aggregated with first and second derivatives.
For the CNN 2D architecture we adapted the model from Yu Su et al.13, employing four convolutional layers in contrast to the six layers proposed in the original design. We also did a few modifications to the stride values in specific layers and implemented additional dropouts. The model illustrated in Fig. 10 was compiled using a categorical cross-entropy loss function, stochastic gradient descent optimizer (learning rate = 0.001 and momentum = 0.9), and accuracy as the evaluation metric:
The first convolutional layer has 32 filters with a kernel size of 3 x 3 and stride of 2 x 2 using ReLU activation function, followed by batch normalization;
The second convolutional layer is identical to the first one, but followed by a max-pooling layer with a pool stride of 2 x 2 to reduce the dimensions of the convolutional feature maps and a dropout of 50% to reduce overfitting;
Follows the third convolutional layer with 64 filters, kernel size of 3 x 3, and stride of 1 x 1 using ReLU activation function, followed by batch normalization;
Similar to the second layer, the fourth convolutional layer is identical to the third one, but followed by a max-pooling layer with the pool stride of 2 x 2 and a dropout of 50%;
In preparation for the next step, a flatten layer reshapes the output of the previous layer into a 1-dimensional vector;
Before the final output, a fully connected dense layer with 1,024 neurons is added followed by another dropout layer set to 50%;
Finally, the classification task is performed in the last layer. It has a number of neurons equal to the number of output classes (6), and it uses the softmax activation function to output probabilities for each class.
[See PDF for image]
Fig. 10
Representation of the CNN 2D architecture.
The 10-fold cross-validation was performed with the following parameters:
batch_size = 32;
epochs = 100;
EarlyStopping with monitor = ‘val_accuracy’, min_delta = 0,0001, patience = 20;
restore_best_weights = True.
Experiment 1
Evaluation of the predictive models using a sliding window, aiming, first, to embed the best model in the Raspberry Pi (more relevant in the context of autonomous vehicles due to reduced prediction time), and second, to establish a benchmark of results for future works. This experiment leveraged the technique of frame-level normalization17 which involves dividing the audio samples into n short audio clips. Each clip lasts around 1 s and overlaps with a 50% leap comprising 44 frames per window (each frame approximately 46 ms) at a sampling rate of 22,050 Hz. The feature extraction process utilized frame size of 1,024 samples, hop length of 512 samples, and Hann windowing technique. Given the duration of the audio sample adjusted to 4 s and the aforementioned overlapping, the number of audio clips or windows remains constant at 7, as depicted in Fig. 11.
[See PDF for image]
Fig. 11
Frame level normalization and feature extraction for a random sample.
Experiment 2
Evaluation of the predictive models using a single window to establish a benchmark of results for future works. This experiment uses the complete audio sample (Fig. 12) as input for the feature extraction process, also with frame size of 1,024 samples, hop length of 512 samples and Hann windowing technique. Librosa outputs 173 frames with these parameters.
[See PDF for image]
Fig. 12
Single window or complete audio feature extraction for a random sample.
Experiment 3
In this section, we focused on deploying the predictive models onto a Raspberry Pi to evaluate their accuracy and real-time processing capabilities. Predictive models were saved in specific file formats: .PKL for SVM, LR, and RF, and .HDF5 for neural networks such as ANN, CNN 1D, and CNN 2D. The evaluation involved several key metrics, including response time (total prediction time) and processing memory (RAM), which are crucial for real-time audio signal processing. Total prediction time is the sum of the feature extraction process and the classification time. However, this does not include the time needed to digitize the audio in real-time inferences. The models were trained using folds 2 through 10, with fold 1 reserved for testing. This approach was validated by fold 1’s performance, which showed in our experiments an accuracy closely matching the overall cross-validation accuracy. Neural network models underwent quantization using TensorFlow Lite without significant accuracy loss.
For comparative analysis between a notebook and Raspberry Pi, both devices maintained similar operating conditions to ensure consistent results across different processor architectures. Despite not being a quantitative comparison per se, essential factors such as processor architecture and resource management were superficially verified. Additionally, the flash memory requirements representing the size of the predictive models were also evaluated: LR (19.0 kB), CNN 1D (6,218.0 kB), ANN (8,928.0 kB), CNN 2D (11,920.0 kB), SVM (32,761.0 kB), and RF (249,522.0 kB). These sizes were normalized alongside RAM and total prediction time to inform the choice of the predictive algorithm embedded in the Raspberry Pi for real-world testing.
Classification results
This subsection presents the results encompassing both methodological approaches. Experiments 1 and 2 utilized the sliding window process and the entire audio signal, while experiment 3 encompasses only the sliding window process. Table 4 summarizes the results of the 10-fold cross-validation, establishing a benchmark of average accuracy results for the US8K_AV dataset given the defined classifiers.
Table 4. Compilation of the feature extraction processes and top-performing classifiers on the US8K_AV dataset.
Features - Classifier | Experiment 1: Sliding window (1 s) | Experiment 2: Single window (4 s) | ||
---|---|---|---|---|
Accuracy | (Std) | Accuracy | (Std) | |
Handcrafted - SVM | 78.96% | (1.82%) | 83.14% | (2.21%) |
Handcrafted - LR | 78.87% | (1.49%) | 82.27% | (2.74%) |
Handcrafted - RF | 75.59% | (2.72%) | 79.94% | (2.63%) |
Handcrafted - ANN | 77.93% | (3.29%) | 83.35% | (2.91%) |
Handcrafted - CNN 1D | 76.69% | (3.74%) | 82.40% | (2.72%) |
Aggregated - CNN 2D | 80.03% | (2.71%) | 82.78% | (3.60%) |
Table 5 compiles the results of the predictive algorithms deployed on the Raspberry Pi. These results, along with the flash memory values, were normalized to create the spider web chart depicted in Fig. 13. This chart substantiated the decision to deploy the predictive algorithm with a CNN 2D for the next phase of experiments involving real-world inferences.
Table 5. Processing memory (RAM) and total prediction time.
Classifier | Notebook | Notebook | Raspberry Pi | Raspberry Pi | ||||
---|---|---|---|---|---|---|---|---|
RAM | (Std) | Total pred. time | (Std) | RAM | (Std) | Total pred. time | (Std) | |
(MB) | (ms) | (MB) | (ms) | |||||
SVM | 34.1 | (4.7) | 223.4 | (11.4) | 32.4 | (8.2) | 764.4 | (39.6) |
LR | 1.1 | (0.1) | 220.7 | (11.8) | 0.2 | (0.03) | 749.2 | (14.6) |
RF | 401.1 | (65.4) | 227.2 | (12.9) | 456.0 | (11.3) | 1,020.9 | (31.1) |
ANN | 20.6 | (9.2) | 250.1 | (16.0) | 0.1 | (0.02) | 752.5 | (12.5) |
CNN 1D | 7.5 | (0.5) | 252.1 | (20.0) | 4.1 | (1.9) | 750.0 | (13.6) |
CNN 2D | 11.7 | (0.2) | 30.2 | (7.2) | 19.4 | (2.5) | 47.6 | (22.0) |
Comparison between the predictive algorithms in the notebook and Raspberry Pi.
[See PDF for image]
Fig. 13
Spider web chart of the predictive algorithms.
Table 6 confirms the approach “Strategic data merging” and “Adding a new class” described in section Methods by comparing the classification metrics of the relevant classes between the US8K and US8K_AV datasets. The columns labeled ‘Imp.’ (Improvement) and ‘Wtd.’ (Weighted average of improvement) were calculated based on the values of the F1-score column.
Table 6. Classification metrics comparison of the relevant classes between the US8K and US8K_AV datasets using fold 1 as validation set.
US8K | US8K_AV | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | Class | Prec. | Recall | F1-score | Prec. | Recall | F1-score | Suppt. | Imp. | Wtd. |
SVM | car_horn | 64% | 88% | 74% | 81% | 96% | 88% | 252 | 19% | 9.58% |
children_playing | 58% | 82% | 68% | 71% | 81% | 76% | 700 | 12% | ||
dog_bark | 76% | 78% | 77% | 85% | 78% | 81% | 700 | 5% | ||
siren | 82% | 65% | 73% | 85% | 75% | 79% | 602 | 8% | ||
LR | car_horn | 67% | 85% | 75% | 84% | 86% | 82% | 252 | 9% | 6.36% |
children_playing | 63% | 80% | 70% | 73% | 83% | 78% | 700 | 11% | ||
dog_bark | 85% | 77% | 81% | 84% | 79% | 81% | 700 | 0% | ||
siren | 82% | 71% | 76% | 85% | 77% | 81% | 602 | 7% | ||
LR | car_horn | 88% | 76% | 82% | 89% | 76% | 82% | 252 | 0% | 16.61% |
children_playing | 54% | 79% | 64% | 72% | 85% | 78% | 700 | 22% | ||
dog_bark | 54% | 81% | 65% | 85% | 80% | 82% | 700 | 26% | ||
siren | 81% | 78% | 79% | 91% | 77% | 84% | 602 | 6% | ||
ANN | car_horn | 72% | 92% | 81% | 84% | 94% | 89% | 252 | 10% | 15.55% |
children_playing | 66% | 71% | 68% | 75% | 82% | 79% | 700 | 16% | ||
dog_bark | 52% | 81% | 63% | 70% | 83% | 76% | 700 | 21% | ||
siren | 69% | 74% | 71% | 89% | 71% | 79% | 602 | 11% | ||
CNN 1D | car_horn | 68% | 85% | 76% | 87% | 90% | 88% | 252 | 16% | 12.31% |
children_playing | 61% | 71% | 65% | 71% | 73% | 72% | 700 | 11% | ||
dog_bark | 56% | 84% | 67% | 73% | 81% | 77% | 700 | 15% | ||
siren | 76% | 70% | 73% | 87% | 74% | 80% | 602 | 10% | ||
CNN 2D | car_horn | 84% | 75% | 79% | 80% | 82% | 83% | 252 | 5% | 8.86% |
children_playing | 64% | 84% | 73% | 72% | 87% | 81% | 700 | 11% | ||
dog_bark | 72% | 79% | 76% | 91% | 76% | 83% | 700 | 9% | ||
siren | 87% | 73% | 79% | 87% | 84% | 85% | 602 | 8% |
Applications and limitations
The dataset’s limitations are intrinsically linked to its intended application within the scope of the study: a swarm-intelligent, multifunctional, fully autonomous robot vehicle with a maximum speed of 30 km/h, specifically designed for deployment in smart city environments. The selection of relevant classes was guided by the project’s current maturity level and its experimental testing in a controlled laboratory environment at Deutsche Bank Park. Several use cases for the autonomous vehicles, such as street cleaning, garden maintenance, and people transportation (taxis), were considered, with an emphasis on the most significant scenarios:
Children playing: in the context environment, where children might be playing behind fences, bushes or tress, all visual sensors may be obstructed. Acoustic detection of ‘children_playing’ is critical for the vehicle to exercise caution or reduce speed in such areas, enhancing safety precautions, specially for the autonomous application of cleaning streets or garden maintenance;
Car horn and dog bark: when approaching a small crossing or emerging from behind a building, the autonomous vehicle may lack visual contact with other road users coming from the left or right. Detecting sounds like a ‘car_horn’ or ‘dog_bark’ enables the vehicle to anticipate and respond to these unseen situations, ensuring safer navigation. The use case of ‘children_playing’ is also valid for ‘dog_bark’;
Siren: acoustic detection of a ‘siren’ allows the vehicle to be alerted to emergency vehicles from a greater distance than visual sensors can provide. This capability sets the vehicle into an attention mode, facilitating compliance with traffic rules and enhancing overall situational awareness.
By presenting the US8K_AV dataset, we aim to encourage future researchers and practitioners to augment it with additional classes related to the aforementioned context, other types of autonomous vehicles, and collecting data in different environmental conditions. This expansion could further enhance the dataset’s applicability to diverse and complex environments.
Drawing from the EU-funded I-SPOT project (https://i-spot-project.eu/) and related works by Yin et al.31, integrating acoustic sensing into autonomous vehicles—such as efficient sensor placement and the development of sound event detection algorithms can significantly enhance environmental awareness. Nevertheless, a regulatory context must be considered. European efforts currently focus mainly on Level 1 and 2 automation according to SAE J3016, such as ADAS features that support driver assistance and partial automation (Automatic Emergency Braking (AEB), Lane Keeping Assistance (LKA), Intelligent Speed Assistance (ISA), Driver Drowsiness and Attention Warning, Event Data Recorders, and Reversing Detection Systems). Even in the current regulatory levels of automation, we see potential application of our dataset in Reversing Detection Systems (technologies such as reversing cameras or sensors to improve safety when backing up) integrating the feedback of the ESR algorithm into a sensor fusion to improve the reliability of this feature considering blind spots (use case: a dog is hidden, barking under a car while the autonomous vehicle is reversing - the system will proceed with additional caution, slower than setup originally - a message will inform the occupants about the potential hazard to road users).
While there is some regulatory movement toward Level 3, comprehensive regulations for Levels 4 and 5 are still evolving. Our dataset aims to contribute to this advancement by providing data relevant to these higher automation levels, recognizing the potential for acoustic information to support safety in more automated systems, such as the ADAS features for:
Traffic Jam Pilot:
Car horn: helps detect traffic congestion or frustrated drivers, signaling the system to anticipate slowdowns and adapt speed accordingly;
Highway Pilot:
Siren: recognizes emergency vehicles approaching from behind or from a distance, enabling the system to perform timely lane changes or adjust speed.
Urban Pilot and Intersection Handling:
Children playing: identifies sounds of children near roads, prompting increased caution and potentially reducing speed in residential areas or near schools;
Dog bark: detects the presence of animals near streets, alerting the system to be cautious of potential road crossings;
Remote Parking Applications:
Car horn and dog bark: alerts to nearby vehicles or the presence of pedestrians/animals when parking, improving safety;
Siren: ensures the system does not obstruct emergency routes during parking maneuvers.
Redundancy Feature: integrating acoustic sensing provides an essential redundancy feature for other sensors like radar, lidar, and cameras. In scenarios where visual or radar sensors might be obstructed or fail, acoustic detection can ensure that the vehicle maintains a high level of safety and reliability. Acoustic data can complement these traditional sensors by filling in observational gaps, offering a robust and versatile sensing solution that enhances overall system performance in real-world applications.
Moreover, there is potential for acoustic information to enhance autonomous systems within complex mixed-traffic environments, such as those in India or Southeast Asia. Prior work, such as the study by Veeraraghavan and Ranga Charan32, explored using acoustic signals with a CNN 2D model to support semi-autonomous vehicles, highlighting the potential of acoustic-based architectures where visual sensor limitations are common. This underscores our intention for the dataset to serve as a foundation for further research, potentially extending its applicability to diverse and heterogeneous traffic scenarios.
It is noteworthy, while synthetic data can be valuable for augmenting datasets, especially in situations where real data is scarce or specific scenarios need to be simulated, it often lacks the ability to fully capture the intricacies and unpredictability of real-world environments. In the context of this study, this limitation is significant because synthetic data might not reflect the authentic variations and noise present in urban settings. Real recordings, on the other hand, provide a more accurate representation of these complexities, leading to models that are more robust and generalizable. By utilizing real-world data, we ensure that predictive models are better equipped to handle the diverse challenges they will encounter in practical applications, thereby enhancing their reliability and effectiveness in ensuring safety and performance.
Conclusion (Value of the data)
Based on the results of experiments 1, 2, and 3, we have determined the following value for the data:
Unique dataset for autonomous vehicle research: the US8K_AV dataset offers a unique combination of environmental sounds specifically tailored for autonomous vehicle applications. By including classes such as ‘car_horn’, ‘siren’, ‘dog_bark’, and ‘children_playing’, along with a newly created ‘silence’ class, this dataset addresses real-world scenarios that autonomous vehicles in the context of smart cities are likely to encounter. Moreover, recognizing silence can contribute to energy efficiency. For instance, in autonomous vehicles, detecting silence can enable the system to reduce power consumption or enter a low-power state when no relevant sounds are present acting in redundancy with other sensors, thereby extending battery life. Other researchers can leverage this improved structure to enhance their own sound recognition models enhancing the auditory perception capabilities of autonomous systems;
Improved algorithm usability and accuracy: the inclusion of a ‘silence’ class and the merging of less relevant classes into a ‘background’ category improved the accuracy and usability of sound recognition algorithms. This methodological enhancement ensures that algorithms trained on this dataset can more reliably distinguish between critical sounds, background noise, and silent periods among the relevant classes, reducing false positives and increasing the overall accuracy of the algorithm which is crucial for developing robust autonomous vehicle systems;
Standardized protocols: the dataset creation process follows a rigorous scientific protocol that ensures the integrity and compatibility of the new classes with the original US8K dataset. Detailed documentation of this protocol was provided in this article, offering a replicable methodology for other researchers;
Broad applicability beyond autonomous vehicles: while specifically designed for autonomous vehicle applications, the methodology utilized to create the US8K_AV dataset has broader implications for various fields within artificial intelligence and robotics. Researchers working on environmental sound recognition in different contexts—such as smart cities, surveillance systems, or personal assistants—can also benefit from this approach.
Usage Notes
This dataset has been specifically designed to aid in the development and evaluation of Environmental Sound Recognition (ESR) algorithms for embedded systems in autonomous vehicles. It was tailored for applications within smart cities, focusing on early warning systems that can operate efficiently under hardware constraints, such as Raspberry Pi. Researchers are encouraged to utilize Python-based libraries such as TensorFlow or PyTorch for training and testing ESR models on this dataset as well as Librosa for audio signal manipulations. The dataset has been extensively tested with neural network models, particularly with CNN 2D using aggregated features, demonstrating superior performance in our benchmarks. Nevertheless, given more constrained resource applications, machine learning techniques such as logistic regression (LR) can also achieve similar accuracy results, albeit with a higher total prediction time.
Currently, there is no specific dataset in the literature designed for autonomous vehicle applications, especially in the context of smart cities. Nevertheless, when comparing this dataset with other ESR datasets, it is essential to align the feature extraction methods, and it is highly recommended to consider retraining classifiers using transfer learning techniques.
To achieve real-time inferences in embedded devices, it is recommended to normalize audio samples to a consistent amplitude range to ensure uniformity across different recording conditions. The feature extraction process is critical for the total prediction time; therefore, researchers should aim to keep it as simple as possible, balancing the trade-offs of more complex architectures since we have demonstrated that using log-mel spectrograms as input features for the models is effective in capturing relevant sound characteristics at low total prediction time and good accuracy rates. Notably, it is vital to implement a sliding window approach with a window size that balances accuracy and latency based on the hardware capabilities and end-applications.
Acknowledgements
We would like to thank the entire staff at FEI and the work colleagues at EDAG GmbH. Special thanks to Mr. Martin Vollmer and Mr. Alexandre Sberveglieri for their invaluable contributions to our professional development, and to Mr. Johannes Barckmann, Mr. Michael Jahn, and Mr. Maximilan Happel for their initial support in this project. Additionally, we confirm that this research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author contributions
A.L.F. led the conceptualization and methodology development, curated and processed the data, developed the software, and wrote the original draft of the manuscript. E.L.D. conducted software validation and contributed to reviewing and editing the manuscript. P.T.A.J. supervised the project and contributed to reviewing and editing the manuscript. All authors have read and approved the final manuscript for publication.
Code availability
This dataset was created as part of a master’s thesis project, with the entire project accessible on GitHub at:
• https://github.com/alf2001br/Master_thesis_Andre_Luiz_Florentino_project/.
Sample code snippets and processing workflows are available upon request to assist researchers in understanding and leveraging the dataset effectively. These include scripts for data preprocessing, model training, and evaluation metrics calculation. Among the resources available, the most relevant for creating the US8K_AV dataset are:
• The Jupyter notebook written in Python 03_New_dataset_US8K_AV.ipynb;
• The CSV file US8K_AV_silence.csv for mapping the folds for the silence class.
The Jupyter notebook above provides a detailed, step-by-step guide on merging classes and adding new samples, enabling other researchers to easily replicate or adapt the methodology for their own research needs.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Mesaros, A et al. Sound event detection in the dcase 2017 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing; 2019; 27, pp. 992-1006. [DOI: https://dx.doi.org/10.1109/TASLP.2019.2907016]
2. Vidaña-Vila, E; Navarro, J; Borda-Fortuny, C; Stowell, D; Alsina-Pagès, RM. Low-cost distributed acoustic sensor network for real-time urban sound monitoring. Electronics; 2020; 9, p. (article number 2119). [DOI: https://dx.doi.org/10.3390/electronics9122119]
3. Sharma, J., Granmo, O.-C. & Goodwin, M. Emergency detection with environment sound using deep convolutional neural networks. In Proceedings of Fifth International Congress on Information and Communication Technology, 1, 144–154, https://doi.org/10.1007/978-981-15-5859-7_14 (Springer Singapore, Singapore, Singapore, 2021).
4. Pandya, S; Ghayvat, H. Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence. Advanced Engineering Informatics; 2021; 47, 101238. [DOI: https://dx.doi.org/10.1016/j.aei.2020.101238]
5. Huang, J. Z., Chhabria, H. & Jain, D. “not there yet”: Feasibility and challenges of mobile sound recognition to support deaf and hard-of-hearing people. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility, 25, p. (article number 15), https://doi.org/10.1145/3597638.3608431 (Association for Computing Machinery, New York, NY, USA, 2023).
6. Jeantet, L; Dufourq, E. Improving deep learning acoustic classifiers with contextual information for wildlife monitoring. Ecological Informatics; 2023; 77, 102256. [DOI: https://dx.doi.org/10.1016/j.ecoinf.2023.102256]
7. Fukuyama, K et al. Identification of respiratory sounds collected from microphones embedded in mobile phones. Advanced Biomedical Engineering; 2022; 11, pp. 58-67. [DOI: https://dx.doi.org/10.14326/abe.11.58]
8. Saraubon, K., Anurugsa, K. & Kongsakpaibul, A. A smart system for elderly care using iot and mobile technologies. In Proceedings of the 2018 2nd International Conference on Software and E-Business, 2, 59–63, https://doi.org/10.1145/3301761.3301769 (Association for Computing Machinery, New York, NY, USA, 2018).
9. Branding, J; von Hörsten, D; Wegener, JK; Böckmann, E; Hartung, E. Towards noise robust acoustic insect detection: from the lab to the greenhouse. KI - Kunstliche Intelligenz; 2023; 37, pp. 157-173. [DOI: https://dx.doi.org/10.1007/s13218-023-00812-x]
10. Jeong, G., Ahn, C. R. & Park, M. Constructing an audio dataset of construction equipment from online sources for audio-based recognition. In Proceedings - Winter Simulation Conference, 2022-December, 2354–2364, https://doi.org/10.1109/WSC57314.2022.10015388 (IEEE, New York, NY, USA, 2022).
11. Alías, F; Socoró, JC; Sevillano, X. A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences; 2016; 6, (article number 143). [DOI: https://dx.doi.org/10.3390/app6050143]
12. Abayomi-Alli, OO; Damaševičius, R; Qazi, A; Adedoyin-Olowe, M; Misra, S. Data augmentation and deep learning methods in sound classification: A systematic review. Electronics; 2022; 11, p. (article number 3795). [DOI: https://dx.doi.org/10.3390/electronics11223795]
13. Su, Y; Zhang, K; Wang, J; Zhou, D; Madani, K. Performance analysis of multiple aggregated acoustic features for environment sound classification. Applied Acoustics; 2020; 158, 107050. [DOI: https://dx.doi.org/10.1016/j.apacoust.2019.107050]
14. Luz, JS; Oliveira, MC; Araújo, FH; Magalhães, DM. Ensemble of handcrafted and deep features for urban sound classification. Applied Acoustics; 2021; 175, 107819. [DOI: https://dx.doi.org/10.1016/j.apacoust.2020.107819]
15. Abreha, G. T. An environmental audio-based context recognition system using smartphones. Master’s thesis (master of science in embedded systems), University of Twente, Enschede, Netherlands (2014).
16. Nordby, J. O. Environmental sound classification on microcontrollers using Convolutional Neural Networks. Master’s thesis (master of science in data science), Norwegian University of Life Sciences, Ås, Oslo, Norway (2019).
17. da Silva, B; Happi, AW; Braeken, A; Touhafi, A. Evaluation of classical machine learning techniques towards urban sound recognition on embedded systems. Applied Sciences (Switzerland); 2019; 9, 3885. [DOI: https://dx.doi.org/10.3390/app9183885]
18. Lamrini, M; Chkouri, MY; Touhafi, A. Evaluating the performance of pre-trained convolutional neural network for audio classification on embedded systems for anomaly detection in smart cities. Sensors; 2023; 23, p. (article number 6227). [DOI: https://dx.doi.org/10.3390/s23136227]
19. Marchegiani, L; Newman, P. Listening for sirens: Locating and classifying acoustic alarms in city scenes. IEEE Transactions on Intelligent Transportation Systems; 2022; 23, pp. 17087-17096. [DOI: https://dx.doi.org/10.1109/TITS.2022.3158076]
20. Florentino, A. L. Us8k_av: Dataset for environmental sound recognition in embedded systems for autonomous vehicles. Harvard Dataverse, https://doi.org/10.7910/DVN/4D8WPK (2024).
21. Salamon, J., Jacoby, C. & Bello, J. P. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, 22, 1041–1044, https://doi.org/10.1145/2647868.2655045 (Association for Computing Machinery, New York, NY, USA, 2014).
22. Piczak, K. J. Esc: Dataset for environmental sound classification. Harvard Dataverse, https://doi.org/10.7910/DVN/YDEPUT (2015).
23. Mesaros, A. et al. Dcase 2017 challenge setup: Tasks, datasets and baseline system, https://inria.hal.science/hal-01627981 (2017).
24. Bountourakis, V., Vrysis, L. & Papanikolaou, G. Machine learning algorithms for environmental sound recognition: Towards soundscape semantics. In Proceedings of the Audio Mostly 2015 on Interaction With Sound, p. (article 5, 1–7), https://doi.org/10.1145/2814895.2814905 (Association for Computing Machinery, New York, NY, USA, 2015).
25. Font, F., Roma, G. & Serra, X. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia, 21, 411–412, https://doi.org/10.1145/2502081.2502245 (Association for Computing Machinery, New York, NY, USA, 2013).
26. McFee, B. et al. librosa: Audio and music signal analysis in python. In Huff, K. & Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference, 18–24, https://doi.org/10.25080/Majora-7b98e3ed-003 (scipy.org, Austin, Texas, USA, 2015).
27. Lhoest, L et al. Mosaic: A classical machine learning multi-classifier based approach against deep learning classifiers for embedded sound classification. Applied Sciences (Switzerland); 2021; 11, p. (article number 8394).1:CAS:528:DC%2BB3MXit1GqsL7P [DOI: https://dx.doi.org/10.3390/app11188394]
28. Mushtaq, Z; Su, SF. Efficient classification of environmental sounds through multiple features aggregation and data enhancement techniques for spectrogram images. Symmetry; 2020; 12, p. (article number 1822). [DOI: https://dx.doi.org/10.3390/sym12111822]
29. Zhang, T. & C. Jay Kuo, C. Content-based audio classification and retrieval for audiovisual data parsing (Springer, New York, NY, 2010), 1 edn.
30. Moore, B. An introduction to the psychology of hearing (Brill, Leiden, Netherlands, 2013), 1 edn.
31. Yin, J., Damiano, S., Verhelst, M., van Waterschoot, T. & Guntoro, A. Real-time acoustic perception for automotive applications. In Proceedings of 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 1, 1–6, https://doi.org/10.23919/DATE56975.2023.10137209 (IEEE, New York, NY, USA, 2023).
32. Veeraraghavan, A. K. & Ranga Charan, S. Soc based acoustic controlled semi autonomous driving system. In Pandian, A. P., Senjyu, T., Islam, S. M. S. & Wang, H. (eds.) Proceeding of the International Conference on Computer Networks, Big Data and IoT (ICCBI - 2018), 1, 591–599, https://doi.org/10.1007/978-3-030-24643-3_71 (Springer International Publishing, Cham, Switzerland, 2020).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Environmental sound recognition might play a crucial role in the development of autonomous vehicles by mimicking human behavior, particularly in complementing sight and touch to create a comprehensive sensory system. Just as humans rely on auditory cues to detect and respond to critical events such as emergency sirens, honking horns, or the approach of other vehicles and pedestrians, autonomous vehicles equipped with advanced sound recognition capabilities may significantly enhance their situational awareness and decision-making processes. To promote this approach, we extended the UrbanSound8K (US8K) dataset, a benchmark in urban sound classification research, by merging some classes deemed irrelevant for autonomous vehicles into a new class named ‘background’ and adding the class ‘silence’ sourced from Freesound.org to complement the dataset. This tailored dataset, named UrbanSound8K for Autonomous Vehicles (US8K_AV), contains 4.94 hours of annotated audio samples with 4,908 WAV files distributed among 6 classes. It supports the development of predictive models that can be deployed in embedded systems like Raspberry Pi.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 Electric Engineering, Centro Universitário FEI - Fundação Educacional Inaciana Pe. Saboia de Medeiros, São Bernardo do Campo, Brazil (GRID:grid.440589.4) (ISNI:0000 0000 8607 7447)
2 Computer Science, UTFPR - Universidade Tecnológica Federal do Paraná, Cornélio Procópio, Brazil (GRID:grid.474682.b) (ISNI:0000 0001 0292 0044)