Content area
Advancements in live audio processing, specifically in sound classification and audio captioning technologies, have widespread applications ranging from surveillance to accessibility services. However, traditional methods encounter scalability and energy efficiency challenges. To overcome these, Triboelectric Nanogenerators (TENG) are explored for energy harvesting, particularly in live‐streaming sound monitoring systems. This study introduces a sustainable methodology integrating TENG‐based sensors into live sound monitoring pipelines, enhancing energy‐efficient sound classification and captioning by model selection and fine‐tuning strategies. Our cost‐effective TENG sensor harvests ambient sound vibrations and background noise, producing up to 1.2 µW cm−2 output power and successfully charging capacitors. This shows its capability for sustainable energy harvesting. The system achieves 94.3% classification accuracy using the Hierarchical Token Semantic Audio Transformer (HTS‐AT) model identified as optimal for live sound event monitoring. Additionally, continuous audio captioning using the EnCodec Combining Neural Audio Codec and Audio‐Text Joint Embedding for Automated Audio Captioning model (EnCLAP) showcases rapid and precise processing capabilities that are suitable for live‐streaming environments. The Bidirectional Encoder representation from the Audio Transformers (BEATs) model also demonstrated exceptional performance, achieving an accuracy of 97.25%. These models were fine‐tuned using the TENG‐recorded ESC‐50 dataset, ensuring the system's adaptability to diverse sound conditions. Overall, this research significantly contributes to the development of energy‐efficient sound monitoring systems with wide‐ranging implications across various sectors.
Introduction
The escalating need for continuous audio processing in recent years has propelled the development of sound classification and audio captioning technologies. These technologies are integral to various sectors, including surveillance, environmental monitoring, human-computer interaction, and accessibility services for the hearing-impaired.[1] For example, in surveillance, continuous audio monitoring in urban or remote environments can detect unusual sounds, enabling early warning systems for security and safety purposes. Alternatively, captioning and real-time sound classification can be used in wearable sensors for accessibility purposes, assisting individuals with hearing impairments in real-time communication. Traditional methodologies for sound classification and audio captioning often depend on centralized processing systems, which can face challenges related to scalability, energy efficiency, and adaptability in dynamic environments. These limitations are particularly evident in remote areas, where access to power infrastructure may be limited; efficient energy management is crucial, and there is a growing need to use sustainable systems. The electret microphone is commonly used in acoustic detection applications among the various types of acoustic sensors. However, the limited choice of vibrating film materials due to the polarization of electret materials, along with the complex fabrication process, results in higher production costs.[2]
Researchers have explored innovative approaches to continuous audio processing to overcome these challenges, including incorporating low-cost energy harvesting technologies.[3] One promising technology in this regard is the Triboelectric Nanogenerator (TENG), which has garnered significant attention due to its ability to convert mechanical energy from ambient sources into electrical energy.[4] In a TENG, two different materials, known as triboelectric layers, come into contact and then separate, causing a transfer of electrical charge. This contact and separation generate an electrical current from mechanical energy, such as vibrations, movement, or sound waves.[5] TENGs offer the potential to power autonomous sensor systems and wearable devices, facilitating sustainable and self-sufficient operation. In contrast to electret materials used in electret acoustic sensors, triboelectric materials are more cost-effective and require simpler processing methods.
Machine learning algorithms can be leveraged for feature extraction and model training on vast speech datasets. Smart devices and robots have already been commercialized with advanced speech recognition and intelligent dialogue capabilities. Examples include Amazon Echo, Apple Home Pod, and Mi AI, among others[6,7] The convergence of TENG's potential to advance technology with live streaming pipelines for sound tasks offers an exciting opportunity for continuous audio processing. Leveraging their superior voltage response over conventional electret condenser microphones, TENGs can also be used in energy harvesting contexts.[8]
Unlike commercial microphones that require battery bias for operation, TENG-based microphones operate purely on sound-triggered contact electrification, eliminating the need for external power sources. This makes TENG devices more sustainable, as they not only function independently of batteries but also generate energy during operation. This harvested energy can be stored in capacitors or batteries and used for powering other circuits, further enhancing the system's efficiency and sustainability.[9] Since TENGs generate electricity from ambient mechanical energy, they can operate continuously without needing battery replacement. Moreover, their capabilities extend beyond traditional energy sources, as they can harness energy from background noise or environmental vibrations. This harvested energy can then be efficiently collected and stored in dedicated storage devices, further enhancing the sustainability and autonomy of TENG-powered systems.[10] This ensures the uninterrupted operation of sound monitoring systems in remote or harsh environments where access to power infrastructure may be limited.
Furthermore, recent advancements in microphone technology face several key challenges primarily due to certain constraints. One major issue is the high production cost of high-quality microphones, which restricts their accessibility and broader use. Additionally, the precision of data captured by microphones and their sensitivity are often inadequate for accurate classification and analysis.[2]
Issues like background noise, signal interference, and inherent limitations in microphone design contribute to inaccuracies in data collection, which can diminish the effectiveness of various applications. TENG-based sensors are cheap, compact, lightweight, and easy to integrate into wearable devices, IoT (Internet of Things) platforms, and other embedded systems. Integrating these sensors with machine learning models can enhance predictive accuracy, making them an even more valuable tool for real-time applications.
However, traditional approaches to processing TENG data for sound classification have encountered challenges such as long operation cycles and low efficiency. For instance, Wang et al. integrated a self-sustaining wireless sensing system, which consists of a TENG sensor for energy harvesting, a wake-up circuit, a Micro Electro Mechanical System (MEMS) microphone for sound collection, and a microcontroller unit (MCU) for wireless transmission. The TENG is employed to harvest ambient sound energy in their setup and to trigger the wake-up circuit. In contrast, sound recording was handled solely by the MEMS microphone, leaving the TENG sensor underutilized for sound detection.[14]
This research directly addresses such inefficiencies by employing a TENG-based sensor for both energy harvesting and real-time sound monitoring. This dual functionality eliminates the need for a separate microphone, simplifying the system architecture and enhancing overall efficiency. The integrated sound sensing and energy-harvesting capabilities offer a streamlined, sustainable solution for real-time sound classification, reducing system complexity and power consumption.
Numerous machine learning methods have been developed for audio classification, encompassing both supervised and unsupervised approaches.[15] Notable examples include the Audio Spectrogram Transformer (AST) and the Self-Supervised Audio Spectrogram Transformer (SSAST) models.[16,17] The AST is an innovative model that relies entirely on attention mechanisms, avoiding convolutions, and is inspired by the success of purely attention-based models in computer vision. It uses audio spectrograms as input, treating them like images.[17] The SSAST builds upon the AST framework by adopting a self-supervised learning approach, which reduces the need for extensive labeled datasets during pre-training and instead uses unlabeled data. Both models have demonstrated strong performance in audio classification tasks on the Environmental Sound Classification (ESC-50) dataset. However, this paper opts to explore two other transformer-based models that have shown superior performance: Hierarchical Token Semantic Audio Transformer (HTS-AT) and Bidirectional Encoder representation from Audio Transformers (BEATs). While AST and SSAST models have shown promising results, they come with certain limitations, such as longer training times, the need for large-labeled datasets, and high computational costs. These constraints make them less suitable for real-time applications, particularly in resource-constrained environments.
In contrast, HTS-AT and BEATs models offer superior performance due to their efficient handling of complex audio signals and reduced computational overhead. The HTS-AT model is selected due to its hierarchical approach to token semantics, which enables it to capture both local and global features of audio signals more effectively than traditional transformers. This hierarchical structure is especially beneficial for complex audio classification tasks that require multiple levels of abstraction to understand the audio content.[11] Meanwhile, the BEATs model utilizes bidirectional encoding, allowing it to account for context from preceding and following audio frames, enhancing its classification and prediction accuracy. However, despite these strengths, its relatively heavy architecture leads to longer prediction times, making it less suitable for real-time applications. Similarly, the HTS-AT model, while efficiently handling complex audio signals, requires significant computational power, which also limits its real-time usability on lower-powered systems.
In contrast, the EnCodec Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning model (EnCLAP) plays a crucial role in this study by providing a lighter and faster alternative for real-time audio processing and captioning on systems with weaker computational resources. With its optimized design for audio signal compression and embedding, EnCLAP is better suited for rapid output generation compared to BEATs and HTS-AT. By efficiently compressing audio data while retaining essential features, EnCLAP supports scalable, real-time processing without the computational delays typically associated with larger models.[12]
EnCLAP not only accelerates processing but also enhances interpretability. Its joint audio-text embedding capability allows the automated generation of accurate and meaningful captions, translating detected audio events into descriptive text in real-time. This functionality is particularly advantageous in scenarios requiring immediate interpretation of audio events, such as environmental hazard detection, surveillance, and accessibility applications. By enabling the TENG system to convert detected sounds into accessible and interpretable text descriptions, EnCLAP adds significant value to the pipeline, making it a practical solution for dynamic, real-world applications where prompt audio captioning and understanding are essential.
The principal aim of this research is to develop a TENG-based sound sensor that is cost-effective and made from well-known materials like Aluminum (Al) and Teflon, which are both inexpensive and lightweight. This sensor also integrates advanced machine learning models such as HTS-AT, BEATs, and EnCLAP, and it is intended to live-stream sound classification and audio captioning, promoting the development of energy-efficient sound monitoring systems. The TENG sensor serves a dual role as both an energy harvester and a sound signal collector, replacing the need for traditional microphones. To the best of our knowledge, this study is the first to apply HTS-AT and BEATs models in the context of TENG-based sound monitoring systems, introducing a novel approach to enhance real-time sound classification and captioning.[18] This pipeline contains the entire workflow from data acquisition to final output, integrating signal processing, machine learning algorithms, and caption generation modules. At the core of this approach lies the integration of TENG-based energy harvesting units, which provide the necessary power for sustained operation in real-world scenarios. Additionally, the performance of the pipeline will be assessed using reliable datasets and compared with existing non-live methods. This technology holds great promise for enabling innovative applications in fields such as environmental wireless monitoring, healthcare, smart cities, self-powered sensors, IoT, Human-Robot Interaction (HRI), and assistive technologies, thereby paving the way for a more sustainable and interconnected future.[19–21]
Results and Discussion
System Design
The live streaming system for sound classification and audio captioning, depicted in Figure 1, comprises the TENG sensor and machine learning modules. The TENG device continuously monitors environmental sounds and begins recording when a sound event exceeds a predefined threshold. It captures 5 s of audio starting from the moment the threshold is exceeded and then passes the audio clip to the HTS-AT, BEATs, or EnCLAP to provide contextual explanations of the sound event.[11,12]
[IMAGE OMITTED. SEE PDF]
Regarding the design of the TENG device, as illustrated in Figure 2a, the TENG features layers of Al-deposited paper and Al-deposited Polytetrafluoroethylene (PTFE), each with a thickness of 300 nm, serving as the positive and negative triboelectric components respectively. The Al coating enhances charge collection on the PTFE layer and serves as both electrode and tribo-positive layer on the paper. The paper layer has a thickness of 100 µm, while the PTFE layer is 50 µm thick, providing a balanced structure for effective triboelectric generation. To optimize performance, the tribo-positive paper layer incorporates patterned holes made with Ultraviolet (UV)Environmental Sound Classification laser for an improved resonance response. These holes are strategically designed to interact with acoustic waves by permitting air pressure to pass through, which enhances the vibration of the triboelectric membrane. By reducing air resistance that could dampen vibrations, the holes form small resonant cavities that amplify vibrations as sound waves move through them.[22] The Polydimethylsiloxane (PDMS) separator ensures uniform coverage and facilitates electrical signal collection. This separator is delicately sliced into lines and positioned during assembly to match the device's perimeter.[22] Copper tape wiring connects the Al sections, finalizing assembly and enabling signal collection. Figure 2b shows the cascade device, which consists of interconnected units in series with dimensions of 2 cm × 2 cm, 3 cm × 3 cm, and 4 cm × 4 cm, ensuring comprehensive sound frequency coverage from 20 Hz to 20 kHz.[23] Inspired by the basilar membrane in acoustic sensors, this design enhances sensitivity and accuracy across the frequency spectrum.[10,24–26]
[IMAGE OMITTED. SEE PDF]
Acoustic Analysis and Characteristics
The sound sensing performance of the TENG was evaluated using a linear frequency sweep ranging from 20 Hz to 20 kHz over a 20-s duration. All tests were conducted under consistent conditions, with a fixed background noise level and a sound pressure level of 90 dB. The voltage output from the device was recorded to analyze the correlation between acoustic inputs and the device's electrical response. One of the key analysis involved generating spectrograms, which visually represent the signal's frequency content over time, helping to observe how the TENG responds to different frequencies. The power spectral density (PSD) was also plotted to evaluate the output power at each frequency. The PSD provided insight into how the output power varied across the frequency spectrum, enabling a more detailed assessment of the device's sensitivity to various sound frequencies.
The frequency response data are presented in Figure 3a, which visually represents the linear frequency sweep captured. A comparison is made between the frequency sweep recorded using the TENG and the original playback sound. Notably, the power reached levels as high as −30 dB, surpassing findings from similar studies.[22,27] Furthermore, using cascaded devices has enhanced frequency detection, extending it close to the upper limit of 20 kHz. These findings demonstrate the potential of TENG technology for integration with sound monitoring systems, as it can detect a wide range of frequencies while maintaining high output performance. This capability is illustrated in Figure 3c, which shows the power spectral density of the recorded frequency sweep, demonstrating the device's ability to capture signals up to ≈20 kHz. The figure also indicates that the device can effectively record various frequencies, as evidenced by the high power levels across different frequency bands. Furthermore, its sensitivity is comparable to that of commercial electret microphones, which are more costly and energy-consuming, with a sensitivity of ≈−40 dB for frequencies below 5 kHz.[27] Additionally, the structure of the device was optimized for performance, as shown in Figure 3b, where a hole size of 0.8 mm was found to yield the highest output of ≈26 volts. These holes create small resonant cavities that amplify vibrations as sound waves pass through them.[22]
[IMAGE OMITTED. SEE PDF]
Model Selection
The audio classification study employed HTS-AT and BEATs models.[13] The HTS-AT model is a supervised learning approach designed to overcome the AST's current limitations of lengthy training periods and high GPU usage.[11,17,28] On the other hand, the BEATs model utilizes an iterative pre-training framework that incorporates effective acoustic tokenizers and a self-supervised learning strategy. Model fine-tuning was carried out on both the original ESC-50 dataset and a dataset recorded with the TENG device.[29] The HTS-AT and BEATs models were selected for this study due to their proven effectiveness and robustness in audio classification tasks. HTS-AT, with its hierarchical transformer structure, excels in capturing long-range dependencies in audio signals, which is crucial for accurate classification. BEATs, on the other hand, leverages a combination of convolutional and transformer-based architectures, providing a balance between local feature extraction and global context understanding. While AST and SSAST are also strong contenders, HTS-AT and BEATs were chosen based on their superior performance in preliminary experiments and their ability to handle the specific characteristics of our dataset. The evaluation utilized the test split from the TENG dataset, conducted in a non-live mode where ESC-50 datasets were recorded with the TENG device before classification, as a substantial amount of data was required for accurate results. The ESC-50 dataset consists of 2000 environmental sounds, organized into five major categories with ten classes, each containing 40 audio clips, totaling ≈2.7 h of audio. It encompasses a wide range of real-world sound conditions, including varying background noise, making it particularly challenging for training transformer architectures, which typically require large and diverse datasets to perform optimally. To address this limitation, fine-tuning pre-trained models, rather than training models from scratch, is employed to improve performance. This approach utilizes pretrained models as a foundational framework, allowing them to apply knowledge from a related task to compensate for the limited data in the ESC-50 dataset. In this study, a pre-trained model on the AudioSet 2M dataset serves as the base model, which is then fine-tuned using the TENG-recorded ESC-50 dataset to optimize performance.[29]
For the task of audio captioning, the paper selects the EnCLAP model, showcasing state-of-the-art results in two commonly used audio captioning benchmarks, AudioCaps and Clotho. Prior to caption generation, spectral subtraction is applied to eliminate background noise from recorded audio, thereby enhancing sound quality. To produce such high performance, EnCLAP employs two different encoders to obtain representations in two different levels. More specifically, EnCLAP employs EnCodec, a neural codec model, to gather time step-level, discrete representations, and CLAP for sequence-level, continuous representations. By concatenating the two representations together, EnCLAP obtains the input to the decoder, which is a pre-trained BART model.
The training objective is to minimize the cross entropy between the ground-truth caption and the EnCLAP-generated caption. Moreover, EnCLAP proposes an auxiliary task of Masked Codec Modeling (MCM), which span masks the EnCodec representations and proves to improve the acoustic awareness of pretrained language models.
Audio Classification and Captioning
Qualitative and quantitative analyses were conducted to evaluate audio quality, utilizing human listening assessments. Additionally, random samples were selected, and their spectrograms were compared with the original recordings to verify high-quality sound capture.
As shown in Figure 4, the TENG's capability to accurately capture a broad range of frequencies is highlighted by the comparison between the audio recorded by the TENG and the original playback sound. When the models were evaluated on datasets, their performance was similar for both the ESC-50 and TENG ESC-50 recorded datasets.
[IMAGE OMITTED. SEE PDF]
After confirming the device's strong performance in recording the various sounds in the ESC-50 dataset, machine-learning models were employed for classification. The audio classification results revealed unsatisfactory performance for HTS-AT and BEATs pretrained on the original ESC-50, achieving 46.50% and 53.25% accuracy, respectively, on the TENG-recorded ESC-50 dataset. Notably, fine-tuning on the TENG-recorded dataset significantly improved performance, with HTS-AT achieving 94.30% accuracy and BEATs reaching 97.25%. This underscores the necessity of fine-tuning on the recorded dataset rather than relying solely on the original. The HTS-AT model achieved comparable accuracy in live streaming system tests, demonstrating its effectiveness in real-time audio processing. These results are detailed in Table 1.
Table 1 Model performance on different fine-tuning datasets.
| Model name | Fine-tune dataset | |
| Original | TENG | |
| HTS-AT | 46.50 | 94.30 |
| BEATs | 53.25 | 97.25 |
The difference in accuracy between BEATs and HTS-AT can be attributed to their distinct model architectures, with BEATs, a heavier, self-supervised model, demonstrating superior performance. In the live pipeline, consistency in structure and accuracy is maintained, as the same model and recording device are employed. However, for live sound event monitoring, the HTS-AT model is chosen for classification despite its lower accuracy compared to BEATs. This decision is influenced by the fact that HTS-AT is a lighter model than BEATs, resulting in shorter classification times in the live system.[13] The token semantic model architecture used in the HTS-AT also allows it to identify the timestamps of sounds, which is desirable as it allows the model to identify the environmental sounds. The HTS-AT employs attention calculation within windows of input spectrograms throughout the network and uses a patch-merge layer to reduce sequence size. The model was found to scale well with higher-resolution spectrograms, enabling the encoding of longer audio segments.[11] Thus, in the live pipeline, efficiency and time are prioritized, making HTS-AT more suitable despite its slightly lower accuracy. The entire process, from sound event detection, recording, passing to the model, and displaying the prediction result, was completed within a time frame of 12 s.
The study conducted continuous audio captioning using the EnCLAP model, achieving remarkable results with a processing time of 6 s (see example captions generated in Table 2). This duration encompasses audio recording, sound event detection, spectral subtraction for quality enhancement, and, ultimately, captioning by the EnCLAP model. Its ability to swiftly process audio data highlights its suitability for applications requiring rapid and accurate captioning, particularly in live-streaming sound monitoring systems.
Table 2 Example captions generated from the EnCLAP model.
| Human labels | Model output |
| keyboard typing | Mechanical humming and tapping |
| laughing | A person is laughing |
| church bell ringing | A large bell rings |
| alarm clock | A phone is ringing |
| frog croaking | High pitched screeching |
| siren | An emergency siren is triggered several times |
Live Pipeline
The live pipeline comprises four components: the oscilloscope for picking up the TENG output, sound event detection, audio classification, and audio captioning. The oscilloscope captures the voltage output of the device and outputs it continuously onto the laptop, where the rest of the models can run on the data. Sound event detection is implemented primitively with a voltage threshold, with the reasoning being that sound events would result in a higher voltage response compared to the ambient response. After a voltage value exceeding the threshold is detected, that segment of data is clipped and then passed to the two machine-learning models.
The audio classification model chosen is the HTS Audio Transformer (HTS-AT). This model is built upon the Swin Transformer but implements a hierarchical structure and attention calculations within input windows to reduce the total number of parameters used. Ideally, in this system, the model loads and then runs inference on the sound events as they are clipped, but due to how the current setup, the model is reloaded with the dataset each time. As a result, the model must be loaded and reloaded with new sound event for running inference each time, adding to the time latency of the pipeline.
Table 2 presents the comparison between human-labeled sound categories and the corresponding outputs generated by the EnCLAP model. While the overall performance of the EnCLAP model in the live pipeline was deemed satisfactory, there were instances where discrepancies occurred, resulting in false negatives. However, it is noteworthy that such discrepancies are within acceptable margins and are considered acceptable within the context of the study.
Energy Harvesting
To explore and compare the output performance of different materials used as tribo-positive layers, various eco-friendly and cost-effective materials, including chitosan film, cellulose nanofibrils (CNF) aerogel, an aerogel with CNF and reduced graphene oxide (CNF@rGO), and lignin-containing CNF (LCNF) films, were fabricated with identical thicknesses and structures. These materials were then tested under the same linear frequency sweep conditions as Al. The maximum output voltage results for each material are presented in Figure 5a, where Al achieved the highest output of ≈30 volts. In our comparative study of materials used in the proposed device, Al also showed the highest Signal-to-Noise Ratio (SNR) among all tested options. This superior SNR and the highest maximum output voltage suggest that Al is particularly effective in reducing background noise while accurately capturing sound signals. Thus, it is an excellent choice for real-time sound classification applications and energy harvesting due to its high output performance. The improved performance of the Al-based TENG microphone highlights the critical role of material selection in ensuring high-precision acoustic sensing.[24]
[IMAGE OMITTED. SEE PDF]
The sensor was also tested using various frequencies to evaluate the performance of the proposed acoustic sensor across different sound frequencies. As shown in Figure 5b, the TENG acoustic sensor delivered strong output across a broad frequency range from 50 to 3000 Hz, indicating its ability to detect a wide range of frequency sounds with minimal noise interference. The results indicate that the device's resonant frequency is 200 Hz under a constant sound pressure of 90 dB.
This behavior is attributed to the inherent limitations in the deformability of the vibrating membrane, which causes variations in the contact between the Al layer and the PTFE layer with changing sound frequencies at a constant sound pressure.[30,31] These results highlight the TENG's potential for integration into sound monitoring systems, as it effectively detects a wide range of frequencies and maintains high output performance. Additionally, it demonstrates that the device can harvest maximum energy at a frequency of 200 Hz. Due to the inherent energy harvesting capabilities of TENG devices, they can be repurposed as energy harvesters within a system. By capturing background noise energy, this energy can be stored in a power unit, such as a rechargeable battery or capacitor, potentially transforming the live sound monitoring system into a sustainable, self-powered setup. To illustrate this potential, the device was employed to charge a 1 µF capacitor. At its resonant frequency, ≈2.3 V of charge was achieved after 90 s, as depicted in Figure 5c.
Furthermore, output voltage and current measurements at various load resistances were conducted at the resonant frequency. Figure 5d shows that peak output power reached up to 1.2 µW cm−2. This level of power generation is adequate for the continuous harvesting of background energy. The combination of these findings, alongside the functionality demonstrated in the live sound monitoring system, underscores the potential of integrating the device into a sustainable sound monitoring system.[26]
Conclusion
In conclusion, this research underscores the transformative potential of TENG-based sensors in advancing continuous audio processing technologies. By harnessing ambient mechanical energy, these devices offer sustainable and energy-efficient solutions, addressing challenges in scalability and energy efficiency. In the end, this would result in the deployment of self-sustaining devices in low-energy environments, allowing for intelligent live sound monitoring within natural surroundings.
The TENG-based sensor not only functions as a microphone but also as an energy harvester, effectively eliminating the need for traditional commercial microphones. It can detect sound frequencies up to 20 kHz, showcasing its broad range and high sensitivity. The integration of TENG devices into live-streaming sound monitoring pipelines presents a novel approach to audio processing, enabling efficient sound classification and captioning in dynamic environments. The selected transformer-based models achieve high test accuracy with the recorded data. The study's discoveries, such as the rapid classification and captioning times, along with the demonstrated ability of these sensors to harvest background noise and ambient sound vibrations, significantly contribute to advancing energy-efficient sound monitoring systems. These systems are ideal for applications such as urban surveillance, where continuous sound monitoring is essential for detecting incidents like car accidents, alarms, or glass breaking. Similarly, in remote environments, these sensors can support continuous monitoring by capturing and storing background noise on storage drives for future usage, helping monitor wildlife activities or environmental changes over time. Additionally, in healthcare facilities, continuous audio monitoring could serve as an early-warning system. Other potential applications include industrial sites, where early detection of machinery malfunctions through sound can enhance safety and prevent accidents.
The system's robustness was further demonstrated, achieving similar high accuracy in both live streaming and non-live settings, indicating robust performance across different operational environments. These advancements, including a 1.2 µW cm−2 output power and 94.30% classification accuracy with HTS-AT, demonstrate significant potential for applications in environmental monitoring, healthcare, and assistive technologies. Additionally, the BEATs model achieved a 97.25% accuracy, further solidifying the system's performance in real-time audio classification. Moving forward, further research and development in this area hold promise for realizing a more sustainable and interconnected future.
Experimental Section
Device Fabrication
The TENG device was fabricated by layering Al-deposited paper as the tribe-positive layer and PTFE as the tribe-negative layer. Aluminum was deposited onto the paper using a sputtering process with a sputtering rate of 0.5 °A s−1 at a chamber pressure of 10−5 Torr, and the substrate temperature was maintained at 100 °C. This ensured a uniform coating with high electrical conductivity. The PDMS separator, prepared in a 5:1 ratio with a curing agent, was cured and then sliced into lines ≈0.5 mm in width, with a thickness of 10 microns to match the device's perimeter. This separator was applied between the Al and PTFE layers to maintain consistent spacing, enhancing triboelectric performance by ensuring optimal contact and separation during operation.
To improve acoustic response, the tribo-positive paper layer was patterned with 0.8 mm holes created using a UV laser. The laser operated at half power (0.8 W) and used two passes to ensure complete penetration through the 100 µm-thick paper. For device assembly, the Al-coated PTFE was placed on an acrylic substrate with the PTFE side facing up. A PDMS spacer was then positioned on top of the PTFE, followed by the Al-coated paper layer, creating the layered structure shown in Figure 2a.
Each device is securely mounted on a 1 mm-thick acrylic substrate, with square holes precisely cut by a CO2 laser cutter to match each unit's shape, ensuring a secure and stable fit for all devices. The assembled devices were connected using commercial copper tape for effective signal conduction.
Testing Conditions
The TENG device's acoustic response was evaluated under controlled sound conditions. A speaker was positioned at a distance of 20 cm from the device, and sounds at various frequencies (20 Hz–20 kHz) were played at a consistent sound pressure level (SPL) of 90 dB. Additionally, audio players were anchored to prevent displacement from sound-induced vibrations, and 3D-printed waveguide enclosures were positioned in front of the Faraday cage to further isolate audio players and devices from external influences. Data were collected using a high-sensitivity oscilloscope, and training and testing of the models and performances were processed on a Lenovo laptop featuring a 7th-generation Intel Core i5 CPU, Nvidia GeForce graphics, and 16GB of RAM.
Model Training and Fine-Tuning
The ESC-50 dataset, containing 2000 environmental sound clips across five major categories and 50 specific classes, was used for training and evaluation. Each clip has a duration of five seconds and is organized into ten classes per category with 40 audio clips per class.[29] For efficient data handling, clips of the same class were concatenated to produce 50 audio files of 200 s each. These concatenated audio files were subsequently split back into individual clips. To maintain consistent training and testing conditions, an 80–20% train-test split was applied, with the training data further divided into 80% for training and 20% for validation. This resulted in 1280 clips for training, 320 for validation, and 400 for testing. The training process used a batch size of 32 for training and 16 for evaluation and was conducted over a maximum of 50 epochs with early stopping based on validation accuracy to prevent overfitting. Gradient clipping with a threshold of 1.0 was applied to avoid gradient explosion. All statistical analyses were conducted using Python, with libraries such as NumPy, SciPy, and pandas. For the BEATs,[13] HTS-AT,[11] and EnCLAP[12] pretrained models, we followed the parameters outlined in their respective papers.
Conflict of Interest
The authors declare no conflict of interest.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
K. Drossos, S. Adavanne, T. Virtanen, IEEE. 2017, 374.
H. Yang, J. Lai, Q. Li, X. Zhang, X. Li, Q. Yang, Y. Hu, Y. Xi, Z. L. Wang, Nano Energy. 2022, 104, 107932.
X. Mei, X. Liu, M. D. Plumbley, W. Wang, DCASE arXiv. 2022. 05949v2,.
J. Li, C. Wu, I. Dharmasena, X. Ni, Z. Wang, H. Shen, S.‐L. Huang, W. Ding, Intell. Converg. Netw. 2020, 1, 115.
W.‐G. Kim, D.‐W. Kim, I.‐W. Tcho, J.‐K. Kim, M.‐S. Kim, Y.‐K. Choi, ACS Nano. 2021, 15, 33427457.
N. Cui, L. Gu, J. Liu, S. Bai, J. Qiu, J. Fu, X. Kou, H. Liu, Y. Qin, Z. L. Wang, Nano Energy. 2015, 15, 321.
H. Sun, X. Gao, L.‐Y. Guo, L.‐Q. Tao, Z. H. Guo, Y. Shao, T. Cui, Y. Yang, X. Pu, T.‐L. Ren, InfoMat 2023, 5, e12385.
H. Yao, Z. Wang, Y. Wu, Y. Zhang, K. Miao, M. Cui, T. Ao, J. Zhang, D. Ban, H. Zheng, Adv. Funct. Mater. 2022, 32, 2112155.
X. Pu, C. Zhang, Z. L. Wang, Natl. Sci. Rev. 2022, 10, nwac170.
J. H. Han, J.‐H. Kwak, D. J. Joe, S. K. Hong, H. S. Wang, J. H. Park, S. Hur, K. J. Lee, Nano Energy. 2018, 53, 198.
K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg‐Kirkpatrick, S. Dubnov, ICASSP 2022 –2022 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Singapore 2022.
J. Kim, J. Jung, J. Lee, S. H. Woo, arXiv. 2024, 240117690.
S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, F. Wei, arXiv. 2022, 221209058.
Z. Wang, J. Zhao, L. Gong, Z. Wang, Y. Qin, Z. Wang, B. Fan, J. Zeng, J. Cao, C. Zhang, ACS Appl. Mater. Interfaces. 2023, 15, 37158268.
Z. Zhang, L. Wang, C. Lee, Adv. Sens. Res. 2023, 2, 2200072.
Y. Gong, C.‐I. Lai, Y.‐A. Chung, J. R. Glass, arXiv. 2021, abs/211009784.
Y. Gong, Y.‐A. Chung, J. R. Glass, Interspeech 2021.
M. H. Bagheri, J. Li, E. Gu, K. Habashy, M. M. Rana, A. Abdullah Khan, Y. Zhang, G. Xiao, P. Xi, D. Ban, in 2024 IEEE Canadian Conf. on Electrical and Computer Engineering (CCECE), IEEE, Singapore 2024, 881.
S. L. Ullo, G. R. Sinha, Sensors. 2020, 20, 11.
Q. Xu, Y. Fang, Q. Jing, N. Hu, K. Lin, Y. Pan, L. Xu, H. Gao, M. Yuan, L. Chu, Y. Ma, Y. Xie, J. Chen, L. Wang, Biosens. Bioelectron. 2021, 187, 113329.
Y. Zou, Y. Gai, P. Tan, D. Jiang, X. Qu, J. Xue, H. Ouyang, B. Shi, L. Li, D. Luo, Y. Deng, Z. Li, Z. L. Wang, Fundam. Res. 2022, 2, 619.
N. Arora, S. L. Zhang, F. Shahmiri, D. Osorio, Y.‐C. Wang, M. Gupta, Z. Wang, T. Starner, Z. L. Wang, G. D. Abowd, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 60, 1.
E. Verreycken, R. Simon, B. Quirk‐Royal, W. Daems, J. Barber, J. Steckel, Commun. Biol. 2021, 4, 1275.
X. Hui, L. Tang, D. Zhang, S. Yan, D. Li, J. Chen, F. Wu, Z. L. Wang, H. Guo, Adv. Mater. 2024, 36, 2401508.
H. S. Lee, J. Chung, G.‐T. Hwang, C. K. Jeong, Y. Jung, J.‐H. Kwak, H. Kang, M. Byun, W. D. Kim, S. Hur, S.‐H. Oh, K. J. Lee, Adv. Funct. Mater. 2014, 24, 6914.
X. Fan, J. Chen, J. Yang, P. Bai, Z. Li, Z. L. Wang, ACS Nano. 2015, 9, 25790372.
J. Hillenbrand, S. Haberzettl, G. M. Sessler, J. Acoust. Soc. Am. 2013, 134, EL499.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), IEEE, Singapore 2021, 9992.
K. J. Piczak, Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
Y. Li, C. Liu, S. Hu, P. Sun, L. Fang, S. Lazarouk, Acoust. Austr. 2022, 50, 383.
M. Qu, X. Chen, D. Yang, D. Li, K. Zhu, X. Guo, J. Xie, J. Micromech. Microeng. 2021, 32, 014001.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.