Content area
Locating honking vehicles is crucial for controlling arbitrary honking and reducing environmental noise. However, traditional methods for honking vehicle localization, which utilize sound source localization technology, suffer from inaccuracies and limited detection range due to the multipath effects of sound propagation and environmental noise interference. To address these challenges, an auditory-visual cooperative perception (AVCP) method for honking vehicle localization is proposed, and a detailed workflow of this method is presented. In the AVCP method workflow, the Emphasized Channel Attention, Propagation, and Aggregation in Time-Delay Neural Network (ECAPA-TDNN) is used to recognize honking vehicle models from captured audio signals, as different vehicle models exhibit distinct horn sound characteristics. Subsequently, YOLO v9 is employed to detect vehicles and recognize their corresponding models in the images captured by the camera. Thus, among the vehicles detected and identified using YOLO v9, the honking vehicle is determined as the one whose model matches the vehicle model recognized by ECAPA-TDNN. Additionally, experiments with simulated and public datasets were conducted to evaluate the performance of the AVCP method for honking vehicle localization. The experimental results show that the AVCP method is less susceptible to environmental noise and can more accurately identify and locate vehicles from greater distances compared to traditional methods based on sound source localization technology.
Introduction
With the continuous growth of the automobile industry, the global vehicle population has been steadily increasing, encompassing a wide range of models including trucks, cars, and other vehicles. While this expansion has significantly enhanced the efficiency of goods transportation and facilitated personal mobility, it has also contributed to widespread traffic congestion, particularly in urban settings. In such congested conditions, irresponsible drivers often resort to excessive honking, exacerbating noise pollution. This issue is especially pronounced in densely populated urban areas and sensitive locations such as schools and hospitals, where excessive noise can have profound health implications [1–3].
Emerging evidence highlights the adverse health impacts of transportation-related noise, particularly from road traffic. Chronic exposure has been linked to increased cardiovascular morbidity and mortality through mechanisms such as sleep fragmentation, elevated stress hormone levels, and oxidative stress, all of which contribute to vascular dysfunction and hypertension [4]. According to estimates by the World Health Organization, traffic noise accounts for over 1.6 million healthy life-years lost annually in Western Europe. Supporting this, a large-scale UK Biobank study involving more than 370,000 participants reported that noise levels exceeding 65 dB(A) are significantly associated with elevated blood pressure, triglyceride levels, and glycated hemoglobin, independent of air pollution exposure. Additionally, excessive noise has been shown to exacerbate anxiety and stress, underscoring the urgent need for effective noise mitigation strategies [5].
Beyond cardiovascular health, excessive noise from honking can heighten anxiety and stress levels among drivers and pedestrians alike, potentially leading to broader psychological and social consequences [6,7]. Given these multifaceted impacts, implementing effective measures to monitor and mitigate vehicle honking is imperative to reduce arbitrary noise pollution and safeguard public health.
There are three primary approaches to controlling vehicle honking. The first involves placing no-honking notice boards along roadsides, serving as reminders for drivers to use their vehicle horns responsibly. The second approach entails developing a honking control system that can regulate honking behavior, such as calculating the number of honks within a certain period to alert drivers [8] or automatically adjust to a low-volume horn in silent zones [9]. The third approach involves the installation of a honking monitoring system that enables the identification of vehicles engaged in honking [10,11]. While the first approach lacks enforcement, making it less effective, and the second approach proves impractical and costly, the third approach is more efficient as it actively detects honking vehicles in noise-sensitive areas.
The pivotal technologies within the honking monitoring system are sound recognition and sound source localization (SSL), which utilize a microphone array to capture multi-channel sound signals. The sound signals undergo processing using sound recognition technology to identify the presence of a honking sound. Upon successful identification of honking sounds, its location is determined by applying SSL techniques. Specifically, SSL techniques estimate the direction and distance of a sound source relative to a microphone array by calculating parameters such as the Direction of Arrival (DOA) and source range [12,13]. Currently, three predominant approaches is included within SSL research area: Time Difference of Arrival(TDOA) [14,15], Beamforming algorithms [16,17], and High-Resolution Spectral Estimation techniques [18].
The TDOA method estimates the source position by analyzing temporal differences in the arrival times of acoustic signals across spatially distributed microphones. While computationally efficient, it is notably susceptible to environmental reverberation. Beamforming techniques implement spatial filtering by coherently combining multi-channel signals, enabling real-time directional enhancement primarily through delay-and-sum operations. However, their spatial resolution is inherently constrained by the physical dimensions, or aperture, of the microphone array. High-resolution spectral estimation methods, such as MUSIC, rely on eigen-based subspace decomposition, effectively separating signal and noise components through their inherent orthogonality. However, the accuracy of these methods strongly depends on the precision of microphone array configurations and sensor calibration; even minor deviations can significantly degrade their performance. Furthermore, localization effectiveness substantially deteriorates under low signal-to-noise ratio (SNR) conditions, reducing reliability in noisy environments.
Although SSL technology has been extensively studied and successfully applied in various research areas, such as sound source separation [19,20], automatic speech recognition [21,22], speech enhancement [23,24], and human-robot interaction [25,26], its localization effectiveness remains significantly influenced by multipath effects of sound propagation and environmental noise interference. These limitations are particularly pronounced in open and dynamically changing outdoor environments, restricting the practical applicability of SSL technologies. In studies addressing the localization of honking vehicles, researchers have commonly employed conventional SSL technologies implemented through meticulously designed microphone arrays [10,11]. The honking monitoring system was designed using a 3D microphone structure in a spiral distribution [10]. Based on the far-field model, the Time Difference of Arrival (TDOA) is calculated using Generalized Cross-Correlation (GCC) algorithm to estimate the direction of the sound source. Experimental results demonstrate that localization is accurate within distance of less than 25 meters. However, the localization accuracy decreases notably as the distance increases within the range of 25 to 40 meters. In [11], a planar array of 32 MEMS microphones was designed to implement the steered response power phase transform (SRP-PHAT) localization algorithm, supporting the localization and tracking of vehicles. Observations indicated that the effectiveness of localization remains nearly 100% at distances of 5 meters and 10 meters on the x-axis, decreasing with greater distances.
Obviously, traditional honking vehicle localization methods based on SSL technology become increasingly inaccurate as the distance increases due to the greater multipath effects of sound propagation and environmental noise interference over larger distances. To expand the detection range of honking vehicles, this paper reports an auditory-visual cooperative perception (AVCP) method for honking vehicle localization, accompanied by a comprehensive elucidation of its workflow.
Given the unique honking sound characteristics exhibited by different vehicle models, the Emphasized Channel Attention, Propagation, and Aggregation in Time-Delay Neural Network (ECAPA-TDNN) is employed to discern the model of honking vehicle from collected honk sound signals. Subsequently, YOLO v9 is utilized to detect all vehicles and identify each vehicle’s model in the captured image, benefiting from its rapid computation and high accuracy. Hence, among the vehicles detected and identified through YOLO v9, the honking vehicle is identified as the one whose model corresponds to the vehicle model recognized by ECAPA-TDNN. To evaluate the performance of the proposed method, experiments are carried out under different scenarios, and several key factors that influence the performance of the proposed method are analyzed based on simulated and public datasets.
Compared to traditional honking vehicle localization methods based on SSL technology, the AVCP method proposed in this paper does not rely on acoustic propagation characteristics, such as signal arrival times or directional angles. Instead, it exclusively exploits the acoustic fingerprint features inherent in honking sounds, which are identified using the ECAPA-TDNN model. This approach effectively mitigates localization errors caused by multipath propagation and environmental noise interference. Additionally, by integrating vehicle appearance recognition through YOLOv9, the proposed method precisely determines vehicle positions, thereby overcoming inaccuracies commonly associated with traditional SSL-based honking vehicle localization techniques.
The remainder of this paper is organized as follows. Section 2 presents the workflow of the AVCP method for honking vehicle localization in detail. In Section 3, comprehensive performance evaluation experiments are presented with simulated and public datasets, including the impact of background noise of different intensities, the impact of audio signal input modes and the impact of distance between vehicles and capturing camera. Section 4 presents the conclusion of this study and our future work. In summary, the main contributions of our work are as follows.
1. An auditory-visual cooperative perception (AVCP) method for honking vehicle localization is proposed to address the drawbacks in existing methods that rely solely on SSL technology, which are susceptible to environmental noise and cannot accurately identify and locate honking vehicles from greater distances.
2. A detailed workflow of the AVCP method for honking vehicle localization is presented. With the constructed workflow, the honking vehicle’s model and location can be accurately determined in the captured image. Moreover, state-of-the-art audio or image recognition methods can be conveniently applied in the proposed workflow to obtain more accurate performance.
3. For a real-world honking vehicle monitoring system employing the AVCP method, performance evaluation can be conveniently conducted using the designed performance evaluation experiments specific to the AVCP method for honking vehicle localization.
Methodology
The methodology presented in this paper comprises two key components: the honking vehicle model recognition process based on ECAPA-TDNN and the vehicle localization and model recognition process based on YOLO v9. The workflow of the AVCP method for honking vehicle localization is illustrated in Fig 1. As depicted in Fig 1, the audio signal captured by microphone is segmented into 2-second digital audio files. Each 2-second audio file serves as the input for the honking vehicle model recognition process based on ECAPA-TDNN, with the model of the honking vehicle being the output of this recognition process.
[Figure omitted. See PDF.]
While the microphone preserves a 2-second digital audio file, the camera captures the traffic scene image and inputs it into the vehicle localization and model recognition process based on YOLO V9. The output of this process includes the locations of the vehicles and their corresponding models.
It is then straightforward to determine that, among the vehicles identified and located with YOLO V9, the honking vehicle is the one whose model matches the vehicle model recognized by ECAPA-TDNN.
In the following two subsections, the honking vehicle model recognition process based on ECAPA-TDNN and the vehicle localization and model recognition process based on YOLO V9 are presented in detail.
Vehicle model recognition process based on ECAPA-TDNN
Benefiting from the distinctive sound characteristics and timbre displayed by vehicle horns of different models, the use of an ECAPA-TDNN neural network enables the classification of the spectral features found in the horn signals of vehicle models. Numerous features facilitating honking vehicle model recognition are located in the high-frequency band, which undergoes rapid decay during transmission. Therefore, amplifying the energy of the high-frequency component of the audio signal and extracting an effective feature representation becomes crucial for honking vehicle model identification.
As shown in Fig 1, the 2-second digital audio file is preprocessed to enhance the high-frequency band, and Mel-frequency cepstral coefficients (MFCC) are extracted to obtain feature vectors for improved separability and recognizability.
Audio signal preprocessing.
To enhance the high-frequency band contained in the audio signal, a preprocessing operation is conducted on the 2-second audio file before extracting MFCC features, as depicted in Fig 2. In Fig 2, the first step of audio signal preprocessing is pre-emphasis, which is designed to filter the audio signal to amplify its high-frequency components. This procedure mitigates potential high-frequency signal attenuation and distortion during recording and transmission, thereby improving the overall audio signal quality and its high-frequency resolution. Typically, the pre-emphasis filter is configured as a first-order digital filter. The resulting q[n] is calculated according to (1).
(1)
[Figure omitted. See PDF.]
Where α represents the filter coefficient, typically ranging from 0.9 to 1. Commonly utilized values for α are 0.95 or 0.97.
The second step of audio signal preprocessing involves framing. Framing divides the signal into short-time windows to facilitate feature extraction within each window. The typical window length ranges from 20 to 40 milliseconds, with more than 50% overlap between adjacent windows to maintain signal continuity. As illustrated in Fig 2, q[n] is subdivided into M frames, each comprising N samples, thereby creating a matrix of .
After framing, each frame in undergoes windowing to emphasize the sampled part while attenuating the remainder. The Hamming window is a commonly used windowing function, known for its favorable pass-band characteristics, ability to reduce spectral leakage, and suitability for smoothing short-duration signals. The detailed equation for the Hamming window function W(n) is provided as follows.
(2)
where n is a positive integer not larger than N.
During the audio signal preprocessing, the original 2-second audio signal is transformed into a multi-frame digital signal, represented by the matrix in Fig 2. This transformed signal serves as the input for the MFCC feature extraction.
MFCC features extraction.
MFCC is designed based on the knowledge of the human auditory system. It is a set of features comparable to chroma or spectral features. Fig 3 demonstrates the process of MFCC feature extraction.
[Figure omitted. See PDF.]
Firstly, each frame in undergoes a Fast Fourier Transform (FFT) to be converted into a frequency domain signal. The FFT of the i-th frame, denoted as , in and having a length of N is calculated using (3).
(3)
Meanwhile, the power spectrum of is obtained using (4), and is used to form the matrix , as depicted in Fig 3.
(4)
To obtain the Mel spectrum of the i-th frame, the power spectrum is convolved with a set of Mel-scale triangular filter banks. The convolution of with each filter, , is computed at every frequency. Let us define a triangular filter bank comprising L filters. The frequency response of each triangular filter, , is calculated using (5), resulting in the Mel filter bank .
(5)
Where f(l) represents the center frequency of the Mel triangular filter, which is correlated with the number of Mel filters. Consequently, the resultant matrix is obtained according to (6). Namely, it is the result of the power spectrum passes through the Mel filter bank .
(6)
Subsequently, the next step involves performing a logarithmic operation followed by a Discrete Cosine Transform (DCT), as expressed by (7). This process yields what are referred to as MFCC features.
(7)
Where M represents the number of frames, and L denotes the number of filters in the Mel filter banks.
The obtained MFCC features, denoted as , serve as the input to the ECAPA-TDNN, which is employed to detect the presence of honking sounds within the audio signal. If a honking sound is detected, another ECAPA-TDNN is used to identify the corresponding vehicle model. Therefore, the following subsection will introduce the ECAPA-TDNN.
ECAPA-TDNN for honk sound detection and vehicle model recognition.
The ECAPA-TDNN is a time-delay neural network model that has demonstrated remarkable performance on the Vox Celeb public dataset. This network architecture is specifically designed for sound recognition tasks, accounting for both the time and frequency domain characteristics as well as the complex feature composition of audio signals. The ECAPA-TDNN aims to achieve precise and swift sound recognition across diverse scenarios by incorporating methodologies such as channel attention, residual connections, and multi-layer feature aggregation. The network structure of the ECAPA-TDNN [27] is illustrated in Fig 4.
[Figure omitted. See PDF.]
The ECAPA-TDNN takes MFCC feature vectors as input and processes them through a series of TDNN layers, each with a distinct time context range, enabling the capture of audio information at different scales. The key to achieving honking vehicle model recognition lies in the model’s ability to statistically pool audio information of varying scales through the ASP layer and enhance channel attention using SE blocks. Subsequently, a fully connected layer and a Softmax layer are applied to convert the output into a probability distribution, which is then mapped to the target vehicle category, resulting in the recognition of the honking vehicle model.
In the proposed workflow, depicted in Fig 1, two ECAPA-TDNN models with identical structures are trained. The first ECAPA-TDNN is trained using traffic environment sound classification datasets. After this training, it is utilized to detect the presence of horn sounds within the audio signal. The second ECAPA-TDNN is trained using vehicle horn sound classification datasets, enabling the identification of the associated vehicle model. The training process for both ECAPA-TDNN models is depicted in Fig 5. As depicted in Fig 5, the MFCC features within the training data need to be extracted before being input into the ECAPA-TDNN models. This process is elaborated on in the previous two subsections. After the training process, the two ECAPA-TDNN models and their corresponding weights, denoted as Weight I and Weight II in Fig 5, can be acquired for the purposes of honking sound identification and vehicle model recognition.
[Figure omitted. See PDF.]
Meanwhile, the localization of the honking vehicle also requires determination through visual recognition methods. Therefore, the proposed AVCP method for honking vehicle localization utilizes YOLO V9 for vehicle localization and vehicle model recognition in captured images. The following subsection will provide an introduction to the vehicle localization and model recognition process based on the YOLO V9.
Vehicle localization and model recognition process based on the YOLO V9
In Fig 1, two YOLO V9 networks are included in the vehicle localization and model recognition process. The first YOLO V9 network is used to recognize and locate all vehicles in the captured image. Then, the second YOLO V9 network is used to recognize each localized vehicle’s model. The first YOLO V9 network utilizes pre-trained weights derived from the COCO dataset, which encompasses 80 categories. The second YOLO V9 network is pre-trained with vehicle model classification datasets, such as the VRID datasets utilized in this study.
With these two trained YOLO V9 networks, the initial YOLO V9 generates multiple bounding boxes for vehicles, each defined by the x and y coordinates of the box center, along with its width and height. Subsequently, the input image is cropped based on the resulting bounding boxes. These cropped images are then fed into the secondary YOLO V9 network. The secondary YOLO V9 subsequently assigns a label and confidence value to each cropped image, accurately identifying the model of the vehicle. Finally, the position of the honking vehicle is pinpointed as the one whose model corresponds to the identified model recognized from the captured audio signal as depicted in Fig 1.
The utilized YOLO V9 network builds upon the foundations of the YOLO v7 network, introducing several enhancements to improve performance. A notable innovation in YOLO V9 is the implementation of the Generalized Efficient Layer Aggregation Network (GELAN). GELAN melds features from two distinct neural network architectures: Cross Stage Partial Network (CSP Net) and Efficient Layer Aggregation Net (ELAN), creating a robust framework that prioritizes lightweight design, inference speed, and accuracy. Additionally, YOLO V9 incorporates Programmable Gradient Information (PGI), designed to mitigate the information loss that occurs as network depth increases. This approach allows for more effective learning processes and enhanced error back-propagation through deeper network structures. The synergistic integration of PGI and GELAN in YOLO V9 yields significant performance advancements, positioning it well above existing real-time object detectors across various metrics on the MS COCO dataset. Fig 6 illustrates the network structure of YOLO V9 [28].
[Figure omitted. See PDF.]
Experimental
The proposed AVCP method for honking vehicle localization comprises two main processes: the vehicle model recognition process based on ECAPA-TDNN and the vehicle localization and model recognition process based on YOLO V9. Consequently, the experiments in this section focus on assessing the recognition accuracy of each process. Before conducting the performance evaluation experiments, an introduction to the experimental configuration, the audio and image datasets used for training and testing, and the performance evaluation metrics is provided.
Experimental configuration
The hardware configuration for the evaluation experiments includes an Intel(R) Core(TM) i7-13700 CPU, NVIDIA GeForce RTX 4070 GPU, and 12GB RAM. The software environment comprises Windows 10, Python 3.8, CUDA 10.1, and the deep learning framework Pytorch 1.8.1+cu101.
In the audio signal preprocessing stage, the microphone’s sampling frequency is set to 22,050 Hz. The configuration involves framing each 2-second audio signal into frames of 134 counts, each frame consisting of 1024 sample points, with a frame shift of 320 points. For the FFT, a window size of 1024 is utilized, and the Mel filter banks encompass 40 filters. Consequently, the dimension of the MFCC features matrix is .
Regarding the ECAPA-TDNN configuration, the batch size is 16, utilizing the Adam optimizer with an initial learning rate of 0.001. The weight decay coefficient is set to 10−6, and the training process spans 30 epochs.
Audio and image datasets
Vehicle horn sound classification datasets.
In the pursuit of training the ECAPA-TDNN for the identification of specific vehicle models based on their emitted horn sounds, a comprehensive vehicle horn sound classification dataset was compiled using FORZA HORIZON 4, a vehicle honk sound simulation software. The simulated dataset consists of 800 audio signals for each distinct vehicle model, categorized into four groups: 200 honking audio samples free from extraneous noise, 200 samples overlaid with white noise, 200 samples accompanied by simulated wind and rain sounds, and 200 samples with controlled pitch and tempo modifications. Each honking audio sample in the dataset maintains a consistent duration of two seconds.
Traffic environment sound classification datasets.
To utilize the ECAPA-TDNN model for vehicle honking sound detection within audio signals, a comprehensive traffic environment sound classification dataset has been meticulously constructed. This dataset includes two distinct classes: “audio signal with honk sound” and “audio signal without honk sound”.
The “audio signal without honk sound” class is methodically assembled by excluding the sound category related to vehicle horn emissions from both the UrbanSound8K dataset [29] and the Acoustic Event dataset [30]. The UrbanSound8K dataset comprises 8,732 audio samples spanning 10 urban sound categories, while the Acoustic Event dataset encompasses 5,223 audio samples covering 28 sound categories, such as engine noise, footsteps, and speaker sounds, etc.
Conversely, the “audio signal with honk sound” class comprises 8,000 audio samples, as detailed in the previous dataset description of the “Vehicle Horn Sound Classification Dataset.”
This construction of the traffic environment sound classification dataset facilitates a rigorous evaluation of the ECAPA-TDNN model’s capability for honking sound detection, laying a solid foundation for further advancements in audio-based vehicle classification systems within the context of traffic environments.
The VRID dataset [31].
In pursuit of vehicle model recognition, the YOLO V9-c model is trained using the VRID dataset, which is composed of vehicle images captured by 326 high-definition cameras strategically positioned along a city bayfront, spanning 14 days of daylight hours. Characterized by resolutions ranging from 400×424 to 990×1134 pixels, the VRID dataset includes 10,000 images with a specific focus on the ten most prevalent vehicle models. Each model is depicted in 1000 unique images, as listed in Table 1. The images of each vehicle model are taken from diverse road curbs, capturing a variety of lighting conditions, scales, and poses. It is also pertinent to note that images of the same vehicle model may display a range of appearances due to differences in their visual presentations.
[Figure omitted. See PDF.]
The CompCars dataset [32].
The Comprehensive Cars (CompCars) dataset is a large-scale vehicle image dataset comprising two subsets: a web-nature set with 1,716 car models and a surveillance-nature subset containing images captured by real-world traffic cameras. In this work, we focus on the surveillance subset, which includes approximately 50,000 front-view vehicle images spanning 281 vehicle models. These models cover a wide range of vehicle types-such as sedans, SUVs, vans, and jeeps-offering substantially greater inter-class diversity than the sedan-only VRID dataset. Each vehicle model in the surveillance subset is represented by hundreds of images captured under varying conditions. Importantly, unlike the VRID dataset’s narrowly defined sedan categories, the broader diversity of vehicle types in the surveillance subset of the CompCars dataset presents a more coarse-grained recognition task, as vehicles often differ significantly in shape and size. This generally makes it easier for vision models to distinguish between classes.
Performance evaluation index
The emphasis lies on recognition accuracy, thus precision and recall rate are selected as metrics to gauge recognition performance in the evaluation experiments.
Here are the definitions and formulas for recall rate (true positive rate, TPR) and precision:
Recall (True Positive Rate, TPR): It is the percentage of data samples that a machine learning model correctly identifies as belonging to the positive class out of all samples that actually belong to the positive class.
(8)
Precision: It is the percentage of data samples that a machine learning model correctly identifies as belonging to the positive class out of all samples predicted to belong to the positive class.
(9)
These metrics are crucial for evaluating the performance of a machine learning model, especially in tasks where correctly identifying positive instances is essential.
Evaluation experiments on vehicle model recognition process based on ECAPA-TDNN
Impact of background noise with different intensities on vehicle model recognition.
To investigate the impact of environmental noise on the vehicle model recognition performance using ECAPA-TDNN, an experiment was conducted. The experiment compared recognition performance under different signal-to-noise ratios (SNR) for horn audio signals mixed with environmental noise. SNRs of 10dB, 0dB, -1dB, -3dB, and -10dB were used, corresponding to horn signal powers of 10, 1, 0.8, 0.5, and 0.1 times the power of the environmental noise, respectively. Each condition included 200 randomly selected horn audio samples from the vehicle horn sound classification datasets for each vehicle model. Table 2 presents the recall rates observed under varying intensities of background noise. Table 2 illustrates that in the absence of environmental noise, vehicle model recognition achieves a mean recall rate of 99.65%. With a signal-to-noise ratio (SNR) of 10 dB, the mean recall rate remains high at 96.20%. As the SNR decreases to 0 dB, -1 dB, and -3 dB, the mean recall rates are 94.95%, 93.55%, and 92.90%, respectively. A significant decrease in mean recall rate is observed at -10 dB SNR, resulting in 75.90% accuracy. These results suggest that when the energy of the vehicle horn sound is at least half of the energy of the environmental noise, the vehicle model’s mean recall rate exceeds 92%. This indicates that the distance between the honking vehicle and the audio capture equipment has minimal impact on the vehicle model recognition performance.
[Figure omitted. See PDF.]
Impact of audio signals’ input modes on vehicle model recognition process.
During real-time recording of traffic environmental sounds, a microphone consistently captures audio signals. Prior to the vehicle model recognition process based on ECAPA-TDNN, this continuous audio signal is segmented into 2-second segments. Due to the random nature of this segmentation, it does not guarantee that each segment contains an entire horn sound. To evaluate the impact of this random segmentation on ECAPA-TDNN’s performance in vehicle model recognition, a comparative experiment was conducted. This experiment compared two audio input modes: manual audio recording mode, ensuring each segment contains a complete horn sound, and continuous audio recording mode, which simulates real-world audio acquisition where segments are randomly partitioned.
To ensure experimental rigor, multiple horn sound samples were randomly selected from the vehicle horn sound classification dataset and concatenated on a timeline with random intervals between samples. These concatenated samples were used for recognition. The performance of the vehicle model recognition process based on ECAPA-TDNN was then evaluated by continuously capturing these audio samples using a microphone. Importantly, collected audio samples were devoid of ambient noise to isolate its influence on vehicle model recognition.
The experiment recorded the recall rate and precision of vehicle model recognition across 200 occurrences of horn sounds, with detailed results presented in Table 3. Table 3 reveals that in the manual recording mode, the average precision and recall rates for vehicle model recognition are 99.47% and 99.45%, respectively. In contrast, the continuous recording mode yields average recognition precision and recall rates of 96.44% and 95.85%, respectively. This reflects a slight decrease of 3.03% in precision and 3.60% in recall rate compared to the manual recording mode. Hence, the choice of input mode minimally affects the performance of vehicle model recognition.
[Figure omitted. See PDF.]
Evaluation experiments on vehicle localization and model recognition based on YOLO V9-c
The YOLO V9 model is available in four variants differentiated by their parameter counts: v9-s, v9-m, v9-c, and v9-e. The YOLO V9-c model, notable for significant architectural enhancements, is specifically utilized in vehicle localization and model recognition processes. It operates with 42% fewer parameters and demands 21% less computation compared to YOLO v7, while achieving comparable accuracy.
To simulate the performance of vehicle localization and model recognition processes using YOLO V9-c, we consider the setup of a traffic surveillance camera as depicted in Fig 7. In Fig 7, the camera is positioned on a crossbar at a height of 5 meters, with its field of view covering the zone ABCD, spanning horizontally from 5 to 45 meters. The images captured by the camera have a resolution of 3840 × 2160 pixels. Points A and B are 9 meters apart, spanning 3 vehicle lanes, each 3 meters wide. Using the similarity of triangles depicted in Fig 7, we can derive the following equation.
(10)
[Figure omitted. See PDF.]
Given AB=9 meters and OE=5 meters, (10) can be rewritten as follows.
(11)
Meanwhile, suppose a car with a width of 1.6 meters occupies w horizontal pixels in the captured image, located at point F at a distance L from point O. According to the perspective relationship of the camera imaging system, the following equation can be established.
(12)
By combining (11) and (12), w can be calculated as follows.
(13)
It is evident that w decreases as the distance L increases. This relationship suggests that the distance between vehicles and the camera can be estimated from the pixel count of vehicles in the captured image. Therefore, using (13), the pixel count of vehicles at different distances L can be calculated. The pixel counts of vehicles at various distances are presented in Table 4. To assess the influence of distance on the performance of vehicle localization and model recognition using YOLO V9-c, 100 randomly selected images from each vehicle model in the VRID dataset were downscaled according to Table 4. The experimental outcomes are detailed in Table 5. According to Table 5, within the distance range of 5 to 25 meters, both the mean precision and mean recall rates exceed 0.99. Between 30 and 35 meters, the mean recall rate fluctuates between 0.908 and 0.945, while mean precision ranges from 0.855 to 0.951. Beyond 35 meters, there is a notable decline, with mean recall rate starting at 0.908 and mean precision at 0.855. This suggests that the vehicle localization and model recognition process using YOLO V9-c achieves significant performance at distances less than 25 meters. However, recognition performance sporadically decreases for specific vehicle models such as Toyota Corolla I, Toyota Corolla II, and Ford Focus as the distance extends from 25 to 30 meters. Furthermore, as the distance increases from 35 to 40 meters, more instances of diminished recognition performance become apparent, particularly for models like Volkswagen Magotan and Nissan Sylphy. This decrease in recognition performance is likely due to the increasing similarity in appearance of these vehicle models as the distance between the vehicle and camera increases.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Meanwhile, we conducted additional vehicle model recognition experiments on the surveillance subset of the CompCars dataset. Three experimental settings were considered:
1. a 10-class subset consisting of five popular car models overlapping with the VRID dataset and five other common vehicle models;
2. a 50-class experiment using the 50 most data-rich car models in the surveillance subset;
3. the full 281-class experiment covering all car models in the surveillance subset.
In each setting, the vehicle localization and model recognition process based on YOLOv9-c was applied to identify vehicle models at varying camera-vehicles distances, simulated by down-scaling images as described in Table 4. Table 6 presents detailed precision and recall rate results for the 10-class subset across distances ranging from 5 m to 45 m, while Table 7 summarizes the mean performance for the 50-class and 281-class experiments, presented side by side for comparison.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
As shown in Tables 6 and 7, mean precision and recall rate on the 10-class subset remain above 0.96 up to 35 m, with recall declining moderately to about 0.87 at 45 m. Even in the full 281-class experiment, recall rate remains around 0.66 at 45 m, which is considerably higher than the corresponding performance on VRID dataset at the same distance (see Table 5). These results confirm that the proposed AVCP method achieves superior performance on the CompCars surveillance subset compared with the VRID dataset.
The superior results on the CompCars surveillance subset are largely attributable to its broader inter-class diversity. Unlike the VRID dataset, which is limited to sedans with subtle appearance differences (e.g., distinguishing between a Toyota Corolla I and Corolla II, or between a Nissan Sylphy and a Toyota Camry), the surveillance subset encompasses a wide variety of vehicle types-for instance, compact sedans, large SUVs, and jeeps-that are much easier to differentiate visually. As a result, the CompCars surveillance subset constitutes a more coarse-grained recognition task, allowing YOLOv9-c to make fewer errors even when image resolution decreases at longer distances. In contrast, the fine-grained sedan categories in the VRID dataset present a much more challenging benchmark for vehicle model recognition, where the high visual similarity among models increases the likelihood of confusion under adverse conditions.
Conclusion and future work
In this paper, we proposed the AVCP method for honking vehicle localization and detailed its workflow. The method involves recognizing the honking vehicle’s model from captured audio signals using ECAPA-TDNN, and localizing all vehicles in captured images while identifying their models using YOLO V9. The honking vehicle is determined among recognized models with YOLO V9, matching the model identified from the audio signal, thus localizing the honking vehicle. From the experimental results, we conclude that the AVCP method for honking vehicle localization is minimally affected by the SNR of the captured audio signal, achieving a mean recall rate of up to 92.90% even at -3 dB. This indicates effective recognition of honking vehicles located at greater distances. Additionally, honking vehicles within 35 meters from the camera can be effectively located and recognized using YOLO V9. Furthermore, the continuous recording mode of audio signals has only a slight effect on the recognition performance of the vehicle model recognition process based on ECAPA-TDNN compared to manual recording mode.
The proposed AVCP method simulates human auditory and visual cooperative perception processes, offering significant advantages over techniques relying solely on sound source localization technology. This approach effectively addresses challenges common in honking vehicle detection methods using sound source localization, such as susceptibility to background noise interference, complexity of sound acquisition devices, and distance sensitivity that can reduce localization accuracy.
In our work, the performance of the proposed AVCP method for honking vehicle localization is dependent on the effectiveness of the vehicle model recognition process based on ECAPA-TDNN and the vehicle localization and model recognition process based on YOLO V9. Therefore, in future work, we plan to explore methods to reduce background noise using microphone array technology to enhance the performance of ECAPA-TDNN-based vehicle model recognition. Additionally, we will investigate applying super-resolution reconstruction methods to improve the performance of YOLO V9-based vehicle localization and model recognition.
Moreover, in scenarios where multiple vehicles of the same model appear in the captured image, with one of them honking, our future work will explore a combined approach. This approach will employ sound source localization techniques to determine the general direction of the honking vehicle within the captured image. Based on this orientation, the image will be segmented to retain the sub-image containing the honking vehicle. YOLO V9 will then be applied to this sub-image to accurately recognize and localize the honking vehicle.
References
1. 1. Qzar IA, Azeez NM, Al-Kinany SM. The impact of noise pollution on schools’ students of Basra city, Iraq: A Health study. EurAsian Journal of BioSciences. 2020;14:5197–201.
* View Article
* Google Scholar
2. 2. Rahman MM, Ali MA, Khatun R, Tama RAZ. Effect of noise pollution on patients in hospitals and health clinics of Mymensingh Sadar Upazila. International Journal of Innovation and Applied Studies. 2016;18(1):97.
* View Article
* Google Scholar
3. 3. Cai Y, Hansell AL, Blangiardo M, Burton PR, BioSHaRE, de Hoogh K, et al. Long-term exposure to road traffic noise, ambient air pollution, and cardiovascular risk factors in the HUNT and lifelines cohorts. Eur Heart J. 2017;38(29):2290–6. pmid:28575405
* View Article
* PubMed/NCBI
* Google Scholar
4. 4. Münzel T, Sørensen M, Daiber A. Transportation noise pollution and cardiovascular disease. Nat Rev Cardiol. 2021;18(9):619–36. pmid:33790462
* View Article
* PubMed/NCBI
* Google Scholar
5. 5. Kupcikova Z, Fecht D, Ramakrishnan R, Clark C, Cai YS. Road traffic noise and cardiovascular disease risk factors in UK Biobank. Eur Heart J. 2021;42(21):2072–84. pmid:33733673
* View Article
* PubMed/NCBI
* Google Scholar
6. 6. Shinar D. Aggressive driving: the contribution of the drivers and the situation. Transportation Research Part F: Traffic Psychology and Behaviour. 1998;1(2):137–60.
* View Article
* Google Scholar
7. 7. Sagar R, Mehta M, Chugh G. Road rage: an exploratory study on aggressive driving experience on Indian roads. Int J Soc Psychiatry. 2013;59(4):407–12. pmid:23749655
* View Article
* PubMed/NCBI
* Google Scholar
8. 8. Dey A, Majumdar A, Pratihar R, Majumder BD. Design of a smart real-time excessive honking control system. IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE). 2019;14(6):8–12.
* View Article
* Google Scholar
9. 9. Munaf S, Monika S, Monika G, Gopika V, Arthi S. Auto honking control system for vehicles. Int J Eng Sci Res Technol. 2020;9.
* View Article
* Google Scholar
10. 10. Zhao Z, Chen W, Semprun KA, Chen PCY. Design and evaluation of a prototype system for real-time monitoring of vehicle honking. IEEE Trans Veh Technol. 2019;68(4):3257–67.
* View Article
* Google Scholar
11. 11. Zhang F, Li J, Meng W, Li X, Zheng C. A vehicle whistle database for evaluation of outdoor acoustic source localization and tracking using an intermediate-sized microphone array. Applied Acoustics. 2022;201:109113.
* View Article
* Google Scholar
12. 12. Rascon C, Meza I. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems. 2017;96:184–210.
* View Article
* Google Scholar
13. 13. Desai D, Mehendale N. A review on sound source localization systems. Arch Computat Methods Eng. 2022;29(7):4631–42.
* View Article
* Google Scholar
14. 14. Kirmaz A, Şahin T, Michalopoulos DS, Gerstacker W. ToA and TDoA estimation using artificial neural networks for high-accuracy ranging. IEEE J Select Areas Commun. 2023;41(12):3816–30.
* View Article
* Google Scholar
15. 15. Alexandri T, Walter M, Diamant R. A time difference of arrival based target motion analysis for localization of underwater vehicles. IEEE Trans Veh Technol. 2022;71(1):326–38.
* View Article
* Google Scholar
16. 16. Bahingayi EE, Lee K. Low-complexity beamforming algorithms for irs-aided single-user massive MIMO mmwave systems. IEEE Trans Wireless Commun. 2022;21(11):9200–11.
* View Article
* Google Scholar
17. 17. Nakatani T, Ikeshita R, Kinoshita K, Sawada H, Kamo N, Araki S. Switching independent vector analysis and its extension to blind and spatially guided convolutional beamforming algorithms. IEEE/ACM Trans Audio Speech Lang Process. 2022;30:1032–47.
* View Article
* Google Scholar
18. 18. Lee SY, Chang J, Lee S. Deep learning-enabled high-resolution and fast sound source localization in spherical microphone array system. IEEE Trans Instrum Meas. 2022;71:1–12.
* View Article
* Google Scholar
19. 19. Schulze-Forster K, Richard G, Kelley L, Doire CSJ, Badeau R. Unsupervised music source separation using differentiable parametric source models. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1276–89.
* View Article
* Google Scholar
20. 20. Virtanen T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process. 2007;15(3):1066–74.
* View Article
* Google Scholar
21. 21. Subramanian AS, Weng C, Watanabe S, Yu M, Yu D. Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. Computer Speech & Language. 2022;75:101360.
* View Article
* Google Scholar
22. 22. Subramanian AS, Weng C, Watanabe S, Yu M, Xu Y, Zhang S-X, et al. Directional ASR: a new paradigm for E2E multi-speaker speech recognition with source localization. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. p. 8433–7. https://doi.org/10.1109/icassp39728.2021.9414243
23. 23. Xenaki A, Bünsow Boldt J, Græ sbøll Christensen M. Sound source localization and speech enhancement with sparse Bayesian learning beamforming. J Acoust Soc Am. 2018;143(6):3912. pmid:29960460
* View Article
* PubMed/NCBI
* Google Scholar
24. 24. Tao T, Zheng H, Yang J, Guo Z, Zhang Y, Ao J, et al. Sound localization and speech enhancement algorithm based on dual-microphone. Sensors (Basel). 2022;22(3):715. pmid:35161469
* View Article
* PubMed/NCBI
* Google Scholar
25. 25. Gala D, Lindsay N, Sun L. Realtime active sound source localization for unmanned ground robots using a self-rotational bi-microphone array. J Intell Robot Syst. 2018;95(3–4):935–54.
* View Article
* Google Scholar
26. 26. Korayem MH, Azargoshasb S, Korayem AH, Tabibian Sh. Design and implementation of the voice command recognition and the sound source localization system for human–robot interaction. Robotica. 2021;39(10):1779–90.
* View Article
* Google Scholar
27. 27. Sigona F, Grimaldi M. Validation of an ECAPA-TDNN system for forensic automatic speaker recognition under case work conditions. Speech Communication. 2024;158:103045.
* View Article
* Google Scholar
28. 28. Wang CY, Yeh IH, Liao HY. Yolov9: learning what you want to learn using programmable gradient information. In: European conference on computer vision, 2024. 1–21.
29. 29. Salamon J, Jacoby C, Bello JP. A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014. p. 1041–4. https://doi.org/10.1145/2647868.2655045
30. 30. Font F, Roma G, Serra X. Freesound technical demo. In: Proceedings of the 21st ACM international conference on Multimedia. 2013. p. 411–2. https://doi.org/10.1145/2502081.2502245
31. 31. Li X, Yuan M, Jiang Q, Li G. VRID-1: a basic vehicle re-identification dataset for similar vehicles. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), 2017. p. 1–8. https://doi.org/10.1109/itsc.2017.8317817
32. 32. Yang L, Luo P, Loy CC, Tang X. A large-scale car dataset for fine-grained categorization and verification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. p. 3973–81. https://doi.org/10.1109/cvpr.2015.7299023
Citation: Yuan F, Kang J, Yin J, Cao J (2025) An auditory-visual cooperative perception method for honking vehicle localization. PLoS One 20(11): e0337352. https://doi.org/10.1371/journal.pone.0337352
About the Authors:
Fei Yuan
Roles: Methodology, Writing – review & editing
Affiliation: School of Automation, Guangdong Polytechnic Normal University, Guangzhou, Guangdong, China
ORICD: https://orcid.org/0000-0002-4755-9245
Junxi Kang
Roles: Data curation, Validation, Writing – original draft
E-mail: [email protected]
Affiliation: School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China
ORICD: https://orcid.org/0009-0002-1603-5921
Jiao Yin
Roles: Formal analysis, Investigation, Writing – original draft
Affiliation: Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria, Australia
ORICD: https://orcid.org/0000-0002-0269-2624
Jinli Cao
Roles: Writing – review & editing
Affiliation: Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria, Australia
ORICD: https://orcid.org/0000-0002-0221-6361
1. Qzar IA, Azeez NM, Al-Kinany SM. The impact of noise pollution on schools’ students of Basra city, Iraq: A Health study. EurAsian Journal of BioSciences. 2020;14:5197–201.
2. Rahman MM, Ali MA, Khatun R, Tama RAZ. Effect of noise pollution on patients in hospitals and health clinics of Mymensingh Sadar Upazila. International Journal of Innovation and Applied Studies. 2016;18(1):97.
3. Cai Y, Hansell AL, Blangiardo M, Burton PR, BioSHaRE, de Hoogh K, et al. Long-term exposure to road traffic noise, ambient air pollution, and cardiovascular risk factors in the HUNT and lifelines cohorts. Eur Heart J. 2017;38(29):2290–6. pmid:28575405
4. Münzel T, Sørensen M, Daiber A. Transportation noise pollution and cardiovascular disease. Nat Rev Cardiol. 2021;18(9):619–36. pmid:33790462
5. Kupcikova Z, Fecht D, Ramakrishnan R, Clark C, Cai YS. Road traffic noise and cardiovascular disease risk factors in UK Biobank. Eur Heart J. 2021;42(21):2072–84. pmid:33733673
6. Shinar D. Aggressive driving: the contribution of the drivers and the situation. Transportation Research Part F: Traffic Psychology and Behaviour. 1998;1(2):137–60.
7. Sagar R, Mehta M, Chugh G. Road rage: an exploratory study on aggressive driving experience on Indian roads. Int J Soc Psychiatry. 2013;59(4):407–12. pmid:23749655
8. Dey A, Majumdar A, Pratihar R, Majumder BD. Design of a smart real-time excessive honking control system. IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE). 2019;14(6):8–12.
9. Munaf S, Monika S, Monika G, Gopika V, Arthi S. Auto honking control system for vehicles. Int J Eng Sci Res Technol. 2020;9.
10. Zhao Z, Chen W, Semprun KA, Chen PCY. Design and evaluation of a prototype system for real-time monitoring of vehicle honking. IEEE Trans Veh Technol. 2019;68(4):3257–67.
11. Zhang F, Li J, Meng W, Li X, Zheng C. A vehicle whistle database for evaluation of outdoor acoustic source localization and tracking using an intermediate-sized microphone array. Applied Acoustics. 2022;201:109113.
12. Rascon C, Meza I. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems. 2017;96:184–210.
13. Desai D, Mehendale N. A review on sound source localization systems. Arch Computat Methods Eng. 2022;29(7):4631–42.
14. Kirmaz A, Şahin T, Michalopoulos DS, Gerstacker W. ToA and TDoA estimation using artificial neural networks for high-accuracy ranging. IEEE J Select Areas Commun. 2023;41(12):3816–30.
15. Alexandri T, Walter M, Diamant R. A time difference of arrival based target motion analysis for localization of underwater vehicles. IEEE Trans Veh Technol. 2022;71(1):326–38.
16. Bahingayi EE, Lee K. Low-complexity beamforming algorithms for irs-aided single-user massive MIMO mmwave systems. IEEE Trans Wireless Commun. 2022;21(11):9200–11.
17. Nakatani T, Ikeshita R, Kinoshita K, Sawada H, Kamo N, Araki S. Switching independent vector analysis and its extension to blind and spatially guided convolutional beamforming algorithms. IEEE/ACM Trans Audio Speech Lang Process. 2022;30:1032–47.
18. Lee SY, Chang J, Lee S. Deep learning-enabled high-resolution and fast sound source localization in spherical microphone array system. IEEE Trans Instrum Meas. 2022;71:1–12.
19. Schulze-Forster K, Richard G, Kelley L, Doire CSJ, Badeau R. Unsupervised music source separation using differentiable parametric source models. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:1276–89.
20. Virtanen T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process. 2007;15(3):1066–74.
21. Subramanian AS, Weng C, Watanabe S, Yu M, Yu D. Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. Computer Speech & Language. 2022;75:101360.
22. Subramanian AS, Weng C, Watanabe S, Yu M, Xu Y, Zhang S-X, et al. Directional ASR: a new paradigm for E2E multi-speaker speech recognition with source localization. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. p. 8433–7. https://doi.org/10.1109/icassp39728.2021.9414243
23. Xenaki A, Bünsow Boldt J, Græ sbøll Christensen M. Sound source localization and speech enhancement with sparse Bayesian learning beamforming. J Acoust Soc Am. 2018;143(6):3912. pmid:29960460
24. Tao T, Zheng H, Yang J, Guo Z, Zhang Y, Ao J, et al. Sound localization and speech enhancement algorithm based on dual-microphone. Sensors (Basel). 2022;22(3):715. pmid:35161469
25. Gala D, Lindsay N, Sun L. Realtime active sound source localization for unmanned ground robots using a self-rotational bi-microphone array. J Intell Robot Syst. 2018;95(3–4):935–54.
26. Korayem MH, Azargoshasb S, Korayem AH, Tabibian Sh. Design and implementation of the voice command recognition and the sound source localization system for human–robot interaction. Robotica. 2021;39(10):1779–90.
27. Sigona F, Grimaldi M. Validation of an ECAPA-TDNN system for forensic automatic speaker recognition under case work conditions. Speech Communication. 2024;158:103045.
28. Wang CY, Yeh IH, Liao HY. Yolov9: learning what you want to learn using programmable gradient information. In: European conference on computer vision, 2024. 1–21.
29. Salamon J, Jacoby C, Bello JP. A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014. p. 1041–4. https://doi.org/10.1145/2647868.2655045
30. Font F, Roma G, Serra X. Freesound technical demo. In: Proceedings of the 21st ACM international conference on Multimedia. 2013. p. 411–2. https://doi.org/10.1145/2502081.2502245
31. Li X, Yuan M, Jiang Q, Li G. VRID-1: a basic vehicle re-identification dataset for similar vehicles. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), 2017. p. 1–8. https://doi.org/10.1109/itsc.2017.8317817
32. Yang L, Luo P, Loy CC, Tang X. A large-scale car dataset for fine-grained categorization and verification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. p. 3973–81. https://doi.org/10.1109/cvpr.2015.7299023
© 2025 Yuan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.