SATRN: Spiking Audio Tagging Robust Network

Full text

Turn on search term navigation

1. Introduction

Spiking Neural Networks (SNNs), as bio-inspired computational models that emulate the information processing mechanisms of biological neural systems, have emerged as a promising paradigm in neuromorphic computing. By implementing spike-based information transmission and processing, SNNs achieve remarkable energy efficiency through the conversion of dense matrix operations into sparse, event-driven computations [1,2]. This inherent efficiency has driven increasing research interest in applying SNNs to various temporal data processing tasks, including Dynamic Vision Sensor (DVS) processing, video analysis, and acoustic signal processing [3].

Audio tagging, a fundamental challenge in acoustic signal processing, involves the automated identification and classification of multiple acoustic events within audio segments. Unlike conventional single-label classification tasks, audio tagging often requires the simultaneous detection and classification of multiple overlapping sound events [4], such as a concurrent “dog bark” and “vehicle noise” in urban environments. While recent progress in deep learning has improved the audio tagging performance, there are still significant challenges to address in real-world applications.

1.1. Key Challenges in Audio Processing

The deployment of audio processing systems in real-world scenarios presents three major challenges. First, energy consumption remains a significant concern. Traditional approaches typically consume 10–100 times more power than biologically inspired networks. This issue is particularly critical in real-time applications, where sequential data processing introduces a considerable computational overhead, often exceeding the capabilities of edge computing devices and limiting practical deployment options.

Second, conventional neural networks exhibit inherent limitations when dealing with time-varying data. The requirement for fixed-length inputs creates artificial barriers for real-time processing, while traditional architectures often underperform with continuous streaming data. Furthermore, when processing long sequences, these networks frequently encounter error accumulation and vanishing gradient problems, compromising their ability to capture long-term dependencies. Third, current approaches show considerable vulnerabilities in real-world environments, where environmental noise and acoustic perturbations can cause significant performance degradation.

Their limited adaptability to varying input durations and temporal characteristics further reduces their reliability in diverse deployment conditions.

1.2. Advantages of SNN-Based Approach

Drawing inspiration from the biological auditory system’s efficient frequency-selective processing mechanisms [5], SNNs offer a compelling solution to these challenges through their unique architectural characteristics. The transformation of dense matrix multiplications into sparse additions significantly reduces energy consumption, while biologically inspired membrane potential dynamics enable the effective extraction of temporal features from time-varying audio signals. Additionally, discrete spike processing provides natural regularization through sparse activation patterns and enhanced noise resistance, complemented by superior streaming capabilities that eliminate the need for fixed-length inputs.

Recent architectural innovations in SNNs have demonstrated promising results across various domains, including the adaptation of deep learning architectures with membrane potential residual mechanisms [6] and the integration of self-attention mechanisms in spiking neurons [7]. Notably, the Spikeformer architecture combines transformer-based processing with spike-based information transmission, achieving state-of-the-art performance on neuromorphic datasets while maintaining competitive results on traditional computer vision benchmarks.

While existing research has explored various aspects of SNN-based audio processing, including spike encoding/decoding mechanisms and neuron model optimization [8], the potential of dynamic spike encoding for audio processing remains largely unexplored. Previous approaches often relied on conventional preprocessing methods, such as Mel spectrogram features [9], which limit the temporal processing capabilities of SNNs.

1.3. Our Contributions

In response to these challenges, this paper presents a comprehensive investigation of SNN architectures optimized for audio tagging. We introduce a novel SNN architecture specifically designed for multi-label audio classification, demonstrating competitive performance on the UrbanSound8K [10] and Freesound Dataset 50K [11] benchmarks. A key innovation of our work is the development of an event-driven audio encoding framework that processes short-duration segments, significantly enhancing temporal processing capabilities compared to traditional feature extraction methods such as Mel frequency cepstral coefficients (MFCCs) [12] or Mel spectrograms. Through extensive experimental validation, we provide a quantitative analysis demonstrating the enhanced robustness achieved through our event-based encoding approach in SNN-based audio tagging systems.

2. Related Work

2.1. Advances in Spiking Neural Networks

Recent advances in neuromorphic computing have established Spiking Neural Networks (SNNs) as a promising paradigm for energy-efficient information processing, characterized by their bio-inspired architecture and spike-based computation [3,7]. SNNs utilize spiking neurons as fundamental computational units, which transform continuous signals into discrete spike sequences through various mechanisms, including Leaky Integrate-and-Fire (LIF) neurons [13] and Parametric Leaky Integrate-and-Fire Spiking Neurons (PLIF) [14]. The discrete nature of spike-based information processing presents unique challenges for traditional gradient-based learning approaches [15,16], which has led to the development of three primary training methodologies.

Training Methodologies

The first methodology, the ANN-to-SNN conversion approach, transforms pre-trained conventional neural networks into their spiking counterparts [17]. Although computationally efficient, this approach requires extended time steps for an accurate activation approximation [18] and often suffers performance degradation due to conversion artifacts. Moreover, its applicability to neuromorphic datasets is limited by rate-coding constraints [19].

The second approach employs bio-plausible learning methods, such as Spike Timing-Dependent Plasticity (STDP) rules, [20] which modify synaptic strengths based on temporal correlations between pre- and post-synaptic spikes. Despite their biological relevance, these approaches face significant challenges in scalability and robustness [19].

The third and most promising approach utilizes surrogate gradient methods [21,22] to address the non-differentiability of spike functions through carefully designed surrogate gradients [23]. These methods have demonstrated performance comparable to that of traditional ANNs [19,24] and serve as the foundation of our training framework.

2.2. Audio Event Detection and Classification

The exponential growth in multimedia content has intensified the demand for efficient audio event detection and classification systems, especially in resource-constrained environments. These systems enable applications ranging from environmental monitoring to security surveillance [25]. Although robust to visual limitations, current approaches face significant challenges in their resilience to noise and computational efficiency. Environmental noise substantially impacts detection accuracy, while resource requirements often limit deployment in edge computing scenarios. Following established practices [26,27,28,29], we employed signal-to-noise ratio (SNR) metrics to evaluate the model robustness and compare different spike encoding strategies.

2.3. Neural Architectures for Audio Processing

Inspired by neuroscientific findings that the human left hemisphere processes local details while the right hemisphere analyzes global content, modern neural architectures have sought to incorporate similar complementary processing mechanisms. While architectures like Res2Net have attempted to implement multi-scale processing through split-and-concatenation strategies, they often fail to achieve the effective integration of local information interactions and global perspectives across multiple temporal and frequency scales.

Contemporary audio scene classification has witnessed significant advances through deep learning approaches, with architectures like CNN14 [30] and transformer-based models [31] achieving state-of-the-art performance on benchmark datasets. However, these models typically process fixed-duration spectral representations, creating challenges for real-time deployment and edge computing scenarios.

Integration of Attention Mechanisms

Recent developments in attention mechanisms, originally pioneered in natural language processing [32], have been successfully adapted for SNNs [7]. These adaptations have led to significant architectural innovations in temporal–spatial integration, combining temporal convolutions with attention for spike-based filtering [33]. Additionally, enhanced spike processing has been achieved through multi-dimensional attention modules [34], while the Spikeformer architecture [7] has demonstrated the successful integration of transformer mechanisms with spike-based computation. Our work builds upon these advances by incorporating spike-based self-attention with hierarchical feature fusion (HFF) blocks, optimizing both spatial and temporal information processing.

2.4. Research Opportunities and Our Approach

Despite these advances, the application of SNNs to audio tagging presents several unexplored opportunities. Current approaches primarily rely on conventional preprocessing techniques, which impose limitations on temporal processing capabilities, introduce computational inefficiencies in feature extraction, and reduce the adaptability to streaming data. Our work addresses these limitations through two key innovations: the development of an efficient spike encoding framework specifically optimized for streaming audio processing and the integration of novel SNN architectures for multi-label audio classification. These innovations enable our system to achieve competitive performance while maintaining the energy efficiency inherent in spike-based computation.

3. Materials and Methods

This section presents our comprehensive approach to SNN-based audio tagging. Drawing inspiration from neuroscience research on hemispheric specialization, we propose a novel architecture that integrates both local detail processing and global content analysis. We first introduce the fundamental spiking neuron model, followed by our innovative time flow coding strategy. We then detail our SATRN architecture with its unique feature fusion mechanisms and propose an enhanced spike-based attention mechanism for temporal–spatial information integration.

3.1. Spiking Neuron Model

The Leaky Integrate-and-Fire (LIF) neuron model served as our fundamental computational unit, chosen for its optimal balance between biological plausibility and computational efficiency. The model implements three essential phases: membrane potential integration, spike generation, and reset mechanisms.

The membrane potential dynamics are governed by

(1) $τ \frac{d V (t)}{d t} = X (t) - (V (t) - V_{rest})$

where

V (t)

represents the membrane potential,

X (t)

denotes the input stimulus,

V_{rest}

is the resting potential, and

τ

controls the membrane potential decay rate.

Spike generation follows:

(2) $S p i k e (t) = \{\begin{matrix} 1 & if V (t) \geq V_{th} \\ 0 & otherwise \end{matrix}$

Post-spike, neurons reset to $V_{rest}$ , while inactive neurons undergo potential decay according to $τ$ . While advanced variants such as PLIF [24] and KLIF [35] neurons offer additional parametric flexibility, we adopted the standard LIF model for its proven efficiency and reliability in neuromorphic computing tasks.

3.2. Architectural Comparison with Recurrent Networks

Our spike-based approach introduces fundamental innovations that differentiate it from traditional recurrent networks across three critical dimensions [36]. In terms of the connection topology, we departed from the conventional RNN architecture by implementing self-recurrent connections within individual neurons, rather than relying on cross-neuron recurrent pathways. This architectural choice incorporates intra-neuron connections with fixed leakage factors ( $- e^{- \frac{d t}{τ}}$ ) for recurrent weights, significantly reducing the parameter space while preserving robust temporal processing capabilities.

The neuron dynamics in our model represent another key advancement through the implementation of a sophisticated state representation system. This system employs membrane potential dynamics with an integrated reset mechanism, enabling efficient event-driven computation through binary spike-based activation. The temporal integration occurs naturally through membrane potential evolution, providing functionality analogous to but more computationally efficient than that of the forgetting gate mechanism in LSTM architectures.

For the learning framework, we implemented a novel rate-coding-based loss function that integrates temporal information across all time steps:

(3) $L = ∥ Y_{label} - \frac{1}{T} \sum_{t = 1}^{T} o_{t, N} ∥_{2}^{2}$

This approach fundamentally differs from traditional RNN loss computation:

(4) $L = ∥ Y_{label} - W_{y} h_{T, N} ∥_{2}^{2}$

where

h_{T, N}

represents the final hidden state and

W_{y}

denotes trainable weights. Our rate-coding mechanism enables more effective temporal information integration while maintaining computational efficiency.

3.3. Time Flow Coding

While MFCC and Mel spectrogram techniques dominate conventional audio processing, we advance the state of the art through an enhanced Mel spectrogram approach [31,37]. Our framework integrates optimized short-time Fourier transform computation with advanced Mel filter bank processing, enabling dynamic feature map generation that better captures the temporal characteristics of audio signals.

Novel Encoding Strategies

As illustrated in Figure 1, our framework introduces two complementary encoding strategies for efficient audio processing. The baseline Time-Static Coding (TSC) approach establishes a foundation through fixed-length audio segment processing, standard Mel spectrogram generation, and temporal feature replication. Building upon this, we introduce time flow coding (TFC) as our primary contribution, implementing adaptive audio segmentation with $u n i t s e c = T / t i m e s t e p$ . To clarify, in time flow coding (TFC), an audio clip of the length T is divided into smaller segments based on the equation

$Unitsec = \frac{T}{Step}$

where step refers to the number of segments, and unitsec represents the duration of each segment in seconds.

In Time-Static Coding (TSC), we ensure that the total audio duration remains the same as in TFC. For example, if TFC uses an 8 s audio clip with a step = 4 and unit = 2, TSC will process the entire 8 s audio as well. However, since TSC does not dynamically split the input, its spectrogram length will be a multiple of the “step” used in TFC. Consequently, the number of neurons required for processing in TSC will differ from that of TFC. Based on the average lengths, we chose a fixed time step that balanced the need for sufficient temporal resolution and computational efficiency. For clips that did not perfectly align with this fixed time step, we employed a padding strategy, where the last segment of a clip (which may have been shorter than the fixed time step) was padded with the values from the last time slice. This ensured that all segments had a consistent size, allowing the model to process them uniformly and extract features across the entire clip as illustrated in Algorithm 1. This padding technique effectively handles boundary cases where the audio duration is not an integer multiple of the time step. This advanced approach incorporates the independent Mel filtering of segments, dynamic temporal feature concatenation, and innovative buffer-based supplementation for short segments, significantly enhancing the system’s ability to process variable-length audio inputs.

To enhance the robustness of our encoding framework, we implemented a comprehensive augmentation strategy. This included frequency and time masking augmentation [38] combined with a novel time-based random Mixup approach. Additionally, we employed adaptive feature normalization to ensure consistent processing across diverse audio inputs. These augmentation techniques work synergistically to improve the model’s generalization capabilities and resistance to various forms of audio distortion.

Algorithm 1: Time flow coding algorithm.

Input: Audio sequence X of length

t i m e s t e p \times u n i t s e c

Parameters:

n_{m e l s}

, window size, buffer Y Output: Feature sequence

Y = {y_{b, t} ∣ b \in (1, B), t \in (1, T)}

1:. for $t = 1$ to $t i m e s t e p$ do
2:. $x_{t} = X [(t - 1) \times s a m p l e r a t e : t \times s a m p l e r a t e]$
3:. $s e g m e n t_{t} = S T F T (x_{t}, w i n d o w s i z e)$
4:. $o u t p u t_{t} = M e l s p e c t r o g r a m (s e g m e n t, n_{m e l s})$
5:. $y_{t} = o u t p u t_{t}$
6:. end for

3.4. SATRN Architecture

Inspired by biological hemispheric specialization, we propose the SATRN architecture, which utilizes a dual-stream processing approach for feature extraction. Figure 2 illustrates its structure. The architecture consists of three key components, each designed to enhance computational efficiency and representation learning.

The first component, the Spiking Potential Layer (SPL), performs the initial spike encoding of the input data. This layer ensures that the temporal dynamics of spiking signals are preserved, enabling efficient event-driven processing.

The second component, hierarchical feature fusion (HFF), integrates global and local feature representations. It incorporates Potential Feature Fusion (PFF) modules to merge features from both pathways effectively. The global pathway captures broad contextual patterns, while the local pathway employs an inverted bottleneck structure for fine-grained feature extraction in different stages. The multi-scale deployment of PFF modules further enhances feature integration across different abstraction levels.

The third component, Spatio-Temporal Self-Attention (STSA), refines feature representations by capturing dependencies across both spatial and temporal domains. This mechanism enables the network to model long-range relationships in event-driven data, improving its ability to process complex audio signals. The three-part architecture leverages biological principles to achieve high computational efficiency while maintaining strong feature extraction capabilities. It is designed to optimize performance for spike-based neural networks in audio processing tasks.

This integrated architecture leverages biological principles to achieve both computational efficiency and processing effectiveness, representing a significant advance in spike-based neural network design for audio processing tasks.

Potential Feature Fusion

The Potential Feature Fusion (PFF) module adaptively integrates local and global features through

(5) $\begin{matrix} X_{a} & = Concat (X, Y) \end{matrix}$

(6) $\begin{matrix} A & = 1 + tanh (F_{att} (X_{a})) \end{matrix}$

(7) $\begin{matrix} O & = X ⊙ A + Y ⊙ (2 - A) \end{matrix}$

where

$X_{a} \in R^{2 C \times H \times W}$ denotes the concatenated feature representation;
$F_{att} (\cdot)$ represents the attention network;
⊙ denotes element-wise multiplication.

The fusion mechanism implements adaptive feature selection:

$O = \{\begin{matrix} 2 X, & if A = 2 (X feature dominance) \\ 2 Y, & if A = 0 (Y feature dominance) \\ X + Y, & if A = 1 (balanced contribution) \end{matrix}$

The attention network $F_{att}$ is defined as follows:

(8) $F_{att} (X) = BN ({Conv}_{1 \times 1} (Spike (BN ({Conv}_{1 \times 1} (X))))) .$

Here, the two ${Conv}_{1 \times 1}$ blocks represent $1 \times 1$ convolution operations, BN denotes batch normalization, and Spike serves as the activation function of the neuron. To optimize the computational efficiency, we introduced a reduction ratio, r, in the intermediate layer, reducing the channel dimension from $2 C$ to $C / r$ (where we set $r = 2$ ) before restoring it back to $2 C$ .

We implemented membrane potential residual connections rather than spike-based residuals, ensuring preserved binary characteristics, crucial for neuromorphic hardware deployment. Our architecture incorporates parallel processing streams for comprehensive feature extraction. The global pathway integrates features from different stages for hierarchical information fusion, while the local pathway employs an inverted bottleneck structure similar to that of MobileNet [39], which expands low-dimensional features to higher dimensions for efficient feature extraction while reducing the computational complexity as illustrated in Table 1. This dual-stream approach ensures the robust capture of both fine-grained and contextual audio features.

3.5. Spike Time–Space Attention

We introduced an innovative Spike Time–Space Attention (STSA) mechanism that advances temporal–spatial feature integration through parallel processing streams, as illustrated in Figure 2. This mechanism decomposes attention computation into distinct temporal and spatial pathways, enabling specialized processing while maintaining computational efficiency.

3.5.1. Architectural Design

The STSA module implements a sophisticated dual-path attention mechanism optimized for spike-based processing. We employed decoupled attention computation with separate temporal and spatial attention paths, each initialized with independent scaling factors ( $α_{t}$ , $α_{s}$ = 0.25) to ensure training stability. The “SSA” (spatial attention) module, as originally designed [7], works effectively for static image data (e.g., ImageNet), where the focus is primarily on spatial relationships. However, when considering dynamic audio data, especially in the context of time flow coding (TFC), we believe that temporal information plays a crucial role. The temporal behavior of audio signals, such as the occurrence of spikes at specific time steps, requires a model that can capture these temporal variations. As such, we introduced “STSA”, which is designed to handle both spatial and temporal dependencies, making it more suitable for modeling dynamic audio data.The architecture incorporates efficient feature transformation through a channel reduction strategy (reduction ratio of $r = 8$ ) and utilizes 1 × 1 convolutions for Q/K/V transformations with a shared reduction ratio of 2.

3.5.2. Attention Mechanism Analysis

The proposed STSA module implements a dual-path attention mechanism that processes temporal and spatial information independently. For an input tensor,

$X \in R^{B \times T \times C \times t_{m a p} \times F},$

the attention computation proceeds as follows, where

B represents the batch size;
T denotes the temporal length;
C indicates the channel dimension;
$t_{m a p}$ represents the temporal map;
F denotes the frequency.

Temporal Attention Path: We first reshaped the input to isolate temporal relationships:

(9) $X_{t e m p o r a l} = reshape (X) \in R^{B \times C \times T \times (t_{m a p} \times F)}$

The temporal Query, Key, and Value transformations were applied with channel reduction:

(10) $\begin{matrix} Q_{t} & = W_{q} X_{t e m p o r a l} \in R^{B \times (C / r) \times T \times (t_{m a p} \times F)} \end{matrix}$

(11) $\begin{matrix} K_{t} & = W_{k} X_{t e m p o r a l} \in R^{B \times (C / r) \times T \times (t_{m a p} \times F)} \end{matrix}$

(12) $\begin{matrix} V_{t} & = W_{v} X_{t e m p o r a l} \in R^{B \times C \times T \times (t_{m a p} \times F)} \end{matrix}$

where r is the reduction ratio.

The temporal attention was computed independently at each frequency–time location:

(13) $A_{t e m p o r a l} = spike (K_{t}^{T} Q_{t}) \cdot α_{t} \in R^{B \times (t_{m a p} \times F) \times T \times T}$

Spatial Attention Path: Similarly, for spatial attention, we reshaped the input to focus on spatial relationships:

(14) $X_{s p a t i a l} = reshape (X) \in R^{(B \times T) \times C \times t_{m a p} \times F}$

The spatial attention transformations were

(15) $\begin{matrix} Q_{s} & = W_{q}^{'} X_{s p a t i a l} \in R^{(B \times T) \times (C / r) \times t_{m a p} \times F} \end{matrix}$

(16) $\begin{matrix} K_{s} & = W_{k}^{'} X_{s p a t i a l} \in R^{(B \times T) \times (C / r) \times t_{m a p} \times F} \end{matrix}$

(17) $\begin{matrix} V_{s} & = W_{v}^{'} X_{s p a t i a l} \in R^{(B \times T) \times C \times t_{m a p} \times F} \end{matrix}$

The spatial attention focused on channel–frequency relationships:

(18) $A_{s p a t i a l} = spike (K_{s}^{T} Q_{s}) \cdot α_{s} \in R^{(B \times T) \times F \times t_{m a p} \times t_{m a p}}$

Feature Integration: The final output combined both attention paths:

(19) $O = reshape (V_{t} A_{t e m p o r a l}) + reshape (V_{s} A_{s p a t i a l}) + X$

Key Advantages:

Complementary Processing: The temporal path captures sequence patterns, while the spatial path focuses on channel–frequency relationships.
Computational Efficiency: Decomposed attention reduces the complexity from
(20) $O ({(T \times t_{map} \times F)}^{2} \times C)$
to
(21) $O (B \times (t_{map} \times F) \times T^{2}) + O (B \times T \times F \times t_{map}^{2}) .$
Binary Attention: The spike activation function maintains SNN characteristics while providing natural regularization.
Feature Enhancement: The residual connection preserves original information while enriching feature representation.

This dual-path design enables efficient parallel processing while maintaining the spike-based nature of SNNs, providing a balance between computational efficiency and biological plausibility.

3.5.3. Spike-Based Processing

Our architecture implements spike-based computation through two complementary activation mechanisms that balance efficiency with biological plausibility. The basic spike activation mechanism employs threshold-based binary activation to directly convert attention scores to spikes, enabling efficient hardware deployment through simplified computational paths. This approach optimizes resource utilization while maintaining the essential characteristics of spike-based processing. Complementing this, we introduced a sophisticated LIF-based activation mechanism that incorporates membrane potential integration with a decay factor of $τ$ . This biologically inspired approach implements a dynamic threshold mechanism for spike generation and preserves temporal dependencies through continuous potential accumulation. The integration of membrane potential dynamics enables more nuanced temporal processing while maintaining the computational advantages of spike-based operations. Through the synergistic combination of these activation mechanisms, our design achieves significant advantages in three critical dimensions. First, it substantially reduces computational complexity through the implementation of decoupled attention paths and binary operations. Second, the architecture enhances feature representation capabilities through parallel temporal and spatial attention processing, enabling the more comprehensive capture of audio characteristics. Third, the system maintains robust attention computation through the combination of normalized membrane potentials and adaptive scaling, ensuring stable performance across varying input conditions. This integrated approach creates a balanced framework that maintains computational efficiency while preserving the biological plausibility inherent to spike-based neural processing.

3.6. Computational Efficiency Analysis

Our network achieves significant computational efficiency through several synergistic mechanisms. To demonstrate these improvements, we analyzed the system’s performance using an 8 s audio input as a representative example.

Memory-Efficient Feature Extraction: Our approach fundamentally reimagines how audio data are processed:

Traditional approach: Processing the entire 8 s (25 ms hop length) segment generates a large Mel spectrogram of the size $T \times F$ ( $T \approx 320$ frames for 8s of audio).
TFC approach: Dividing the segment into 4 segments of 2 s each achieves smaller individual spectrograms of the size $(T / 4) \times F$ , with reduced peak memory usage during processing and more efficient cache utilization.

The architectural design incorporates several sophisticated innovations that synergistically optimize the computational efficiency. At its core, the design leverages an inverted bottleneck structure, which strategically minimizes the parameter count while maintaining robust feature expressiveness [39]. This is complemented by a hierarchical feature fusion mechanism that facilitates efficient information flow throughout the network. Additionally, the architecture employs segmented processing capabilities, enabling enhanced parallel computation and throughput. Together, these architectural components form a cohesive framework that maximizes computational performance while maintaining algorithmic effectiveness.

The integration of TFC with spike-based processing establishes a powerful synergistic framework that delivers remarkable computational efficiency through multiple mechanisms. This approach achieves enhanced performance by leveraging temporal segmentation to reduce the input dimensionality, while simultaneously replacing traditional floating-point calculations with more efficient binary spike operations. Furthermore, the system benefits from the inherent sparsity of temporal activations, typically maintaining an activation rate of only 15–20%, which significantly reduces the computational overhead. This multi-faceted optimization strategy creates a highly efficient computational paradigm that maximizes resource utilization while minimizing processing demands.

Quantitative Analysis: The efficiency improvements can be quantified for an 8 s audio input:

Traditional Processing:
- -. Memory requirement: $O (T \times F)$ for the full spectrogram.
- -. Computation: $O (N \times M \times K)$ operations on the full feature map.
TFC Processing:
- -. Memory requirement: $O (T / n \times F)$ per segment, where n is the number of segments.
- -. Computation: $O (s \times N \times M \times K / n)$ operations per segment.
- -. An additional benefit from a better cache locality.

Here, N, M, and K represent the input, output, and kernel dimensions, respectively, and s denotes the average spike rate ( $s \approx 0.2$ in our implementation). This optimization translates to approximately an 80% reduction in computational operations compared to equivalent ANN architectures [3]. This segmented approach not only reduces the peak memory usage by approximately 75% but also enables more efficient parallel processing and better hardware utilization. When combined with the inherent efficiency of spike-based computation, the overall system achieves a significant reduction in both the memory footprint and computational cost compared to traditional approaches.

4. Experiments

To rigorously evaluate our proposed approach, we conducted extensive experiments on two widely adopted audio classification benchmarks: FSD50K and Urbansound8k. All experiments were implemented using the spikingjelly framework [40] on an NVIDIA 3090 GPU.

4.1. Experimental Setup

4.1.1. Datasets

We evaluated our approach on two representative datasets with distinct characteristics:

FSD50K [11]: A large-scale multi-label dataset featuring the following:
- -. 200 distinct audio categories.
- -. 108 h of total content.
- -. Training set: 80.4 h (average duration of 7.1 s).
- -. Testing set: 27.9 h (average duration of 9.8 s).
Urbansound8K [10]: A single-label dataset comprising the following:
- -. 8.8 h of urban environmental sounds.
- -. 10 distinct categories (air conditioner, car horn, children playing, etc.).
- -. 10-fold cross-validation following the standard protocol [37,41].

4.1.2. Implementation Details

Our preprocessing pipeline implemented the following steps:

Audio resampling to 16 kHz for standardization.
Segmentation into units of length, $t i m e s t e p * u n i t s e c$ , and seconds.
Short-time Fourier transform with 10ms window size.
128-dimensional Mel filtering producing features of the shape $[t i m e s t e p, c, n, d]$ .

For comparative analysis, we implemented both time flow coding and static encoding approaches.

4.2. Performance Analysis

4.2.1. Evaluation on Urbansound8K

Table 2 and Table 3 present comprehensive performance comparisons on the Urbansound8K dataset. Our analysis revealed several significant findings:

Our experimental evaluation revealed several significant achievements in the model performance. The SATRN-STSA architecture reached 95.5% accuracy, approaching that of state-of-the-art ANN models (95.9% [37]), while retaining the inherent efficiency advantages of spike-based computation. Our novel STSA mechanism demonstrated consistent superiority over traditional spike self-attention [7], confirming the effectiveness of our temporal–spatial integration approach. Additionally, the time flow coding strategy exhibited enhanced performance, with a notably reduced computational overhead compared to that of static encoding. These improvements are particularly evident in the detailed results presented in Table 2 and Table 3, which show that TFC consistently outperformed TSC across various temporal steps, showing especially marked improvements in longer sequences. The experimental data further validated our STSA mechanism’s advantages over conventional attention approaches, while positioning our model competitively within the landscape of state-of-the-art architectures.

4.2.2. Evaluation on FSD50K

The multi-label classification task on FSD50K introduced unique challenges for spike-based processing. To address these challenges effectively, we implemented several key preprocessing strategies:

Adaptive Segmentation: The audio was divided into segments of a predefined time step, where each segment length was determined by $unitsec = T / step$ . This ensured consistency in the segment size across different audio clips.
Padding and Truncation: When an audio clip was shorter than the required segment length, we applied padding by repeating the last valid frame to maintain the required size. Conversely, if an audio clip was longer, we truncated it to the maximum length allowed in the dataset to ensure uniform processing.
Buffer-Based Supplementation: For segments that did not align perfectly due to non-integer multiples of the time step, we introduced a buffer mechanism to extend the last segment using relevant feature information from the preceding segments.

The results in Table 4, Table 5 and Table 6 demonstrate several significant achievements. The SATRN architecture achieved state-of-the-art performance, with our approach reaching an mAP of 0.455, substantially outperforming established ANN architectures, including the CRNN (0.417) and VGG-like models (0.434) [11]. This performance gain is particularly noteworthy as it combined superior accuracy with the inherent efficiency advantages of spike-based computation.

Our analysis demonstrated significant advantages of our approach across multiple critical dimensions. Our method achieved an mAP of 0.455, substantially outperforming conventional ANN architectures and validating the effectiveness of spike-based processing in multi-label scenarios. The implementation of time flow coding with 2 s units proved exceptionally effective in multi-label contexts, primarily due to its enhanced ability to capture and process temporal relationships in complex audio signals. Furthermore, the STSA mechanism demonstrated remarkable consistency in performance improvements across both single-label and multi-label tasks, validating its robust generalization capabilities. These results collectively establish not only the superior performance of our proposed approach but also its consistent advantages across diverse application scenarios.

Overall, while TFC proved effective in reducing the data size for improved performance, as shown in Table 4, we observed that smaller unit durations did not necessarily guarantee better performance on the FSD50K dataset. In fact, a slight decrease in performance was observed with shorter units. Upon further analysis, we hypothesize that smaller data sizes lead to a reduction in the number of neurons in the network, which may impact the network’s ability to effectively capture and process features. This suggests that finding an optimal balance between the data size and network accuracy is crucial. In future work, we aim to explore whether combining longer time steps with shorter durations, in conjunction with the unique characteristics of Spiking Neural Networks, can yield more satisfactory results.

4.3. Robustness Analysis

A distinctive advantage of our approach lies in its resilience to acoustic perturbations. To thoroughly evaluate this aspect, we conducted comprehensive noise resistance experiments across both datasets.

4.3.1. Experimental Setup

Our robustness evaluation framework implemented systematic testing through controlled acoustic perturbations applied to 50% of test samples. We examined the performance across varying noise levels with an SNR ranging from −10 dB to 10 dB, considering both white noise and environmental noise conditions. This comprehensive testing approach enabled the detailed analysis of our model’s behavior under diverse acoustic challenges.

4.3.2. Key Findings

As illustrated in Figure 3, our approach demonstrates superior noise resilience through several synergistic mechanisms. Time flow coding’s segment-wise processing inherently constrains noise propagation, preventing error accumulation across temporal dimensions. Additionally, the binary nature of spike-based processing provides natural robustness against analog noise, effectively filtering out low-amplitude perturbations. Our STSA mechanism further enhances this resilience by maintaining effective feature extraction even under severe noise conditions through adaptive temporal–spatial attention allocation.

4.3.3. Performance Analysis

The experimental results reveal remarkable patterns in noise resilience and adaptability. Our time flow coding maintained consistently superior performance across all noise levels, showing particularly significant advantages (15–20% improvement) in high-noise scenarios where the SNR fell below 0 dB. The model exhibited notably more gradual performance deterioration with increasing noise levels compared to static encoding approaches, demonstrating robust stability under challenging conditions. This enhanced robustness was consistently observed across both single-label (UrbanSound8k) and multi-label (FSD50K) scenarios, validating the generalizability of our approach (see Figure 3 and Figure 4). These comprehensive findings confirm that our approach not only achieves competitive accuracy under ideal conditions but also maintains superior performance in challenging acoustic environments—a crucial advantage for real-world deployments, where noise interference is inevitable.

As illustrated in Figure 3, our approach achieves superior noise resilience through three complementary mechanisms. The time flow coding employs segment-wise processing that inherently constrains noise propagation, effectively preventing error accumulation across temporal dimensions. This is complemented by the binary nature of spike-based processing, which provides natural robustness against analog noise by filtering out low-amplitude perturbations. Additionally, our STSA mechanism enhances resilience by maintaining effective feature extraction even under severe noise conditions through adaptive temporal–spatial attention allocation. These mechanisms work in concert to create a robust framework capable of maintaining performance integrity in challenging acoustic environments.

The experimental results reveal compelling patterns in our model’s robustness and adaptability. Time flow coding demonstrated consistently superior performance across all noise levels, achieving particularly significant improvements in challenging high-noise scenarios where the SNR dropped below 0dB. Notably, our model exhibited more gradual degradation as noise levels increased compared to static encoding approaches, maintaining stability under adverse conditions. This enhanced robustness was not limited to a single context but demonstrated remarkable consistency across both single-label (UrbanSound8k) and multi-label (FSD50K) scenarios, validating the broad generalizability of our approach in diverse acoustic environments.

These comprehensive results validate that our approach not only achieves competitive accuracy under ideal conditions but also maintains superior performance in challenging acoustic environments—a crucial advantage for real-world deployments, where noise interference is inevitable.

5. Conclusions

This work presents substantial advancements in spike-based audio processing through systematic innovations in neuromorphic computing, making fundamental contributions to both the theoretical foundations and practical applications of SNNs in audio processing.

5.1. Methodological Innovations

Our research introduced two significant technical breakthroughs. First, we presented time flow coding (TFC), the first comprehensive temporal encoding framework specifically designed for SNN-based audio processing. This novel approach demonstrates marked advantages over conventional methods, enabling the efficient processing of continuous audio streams while maintaining temporal coherence. Our extensive experimentation provided strong empirical evidence of its enhanced training efficiency and model robustness. Second, we developed the Spatio-Temporal Self-Attention (STSA) mechanism, optimized for spike-based information processing. This innovation effectively integrates temporal and spatial features in the spike domain, consistently improving performance across multiple benchmarks while advancing the theoretical framework for attention mechanisms in neuromorphic computing.

5.2. Practical Implications

Our approach successfully addresses three critical challenges in edge computing deployment. In terms of robustness, the system demonstrates superior resistance to acoustic perturbations across diverse noise conditions, maintaining high performance even under severe noise (SNR < 0 dB) and showing consistent accuracy across both single-label and multi-label tasks. Regarding computational efficiency, our event-driven processing significantly reduces the computational overhead while achieving state-of-the-art performance through optimized resource utilization and sparse spike-based computation. The system also excels in adaptive processing, handling variable-length inputs without a padding overhead and demonstrating enhanced suitability for streaming audio applications with reduced latency.

5.3. Future Research Directions

This work opens several promising avenues for future investigation. In application extensions, our framework can be adapted to broader acoustic processing tasks, including real-time speech recognition, environmental sound monitoring, and acoustic event detection. Cross-domain applications present exciting opportunities in video analysis, multi-modal sensor fusion, and time series data analysis. Theoretical developments could explore advanced attention mechanisms specifically optimized for spike-based computation, enhanced temporal encoding strategies, and the integration of biological learning principles. From an industrial perspective, future work should focus on edge computing deployment optimization, real-time processing frameworks, and hardware–software co-design strategies. These contributions establish a comprehensive framework for deploying SNNs in practical audio processing applications, both advancing theoretical understanding and providing concrete solutions for real-world implementation challenges. The demonstrated success in combining energy efficiency with robust performance paves the way for the widespread adoption of neuromorphic computing in edge applications.

Author Contributions

Conceptualization, S.G. and X.F.; methodology, X.D.; software, P.Y.; validation, S.G., X.F. and Z.Z.; formal analysis, S.G.; investigation, S.G.; resources, S.G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, X.D.; visualization, H.Z.; supervision, X.D.; project administration, X.D.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to confidentiality requirements associated with the project funding.

Conflicts of Interest

The authors unequivocally declare the absence of any conflicts of interest. Moreover, it is explicitly stated that the funders exerted no influence on the study’s design, the gathering and analysis of data, the interpretation of results, the drafting of the manuscript, or the determination to disseminate the findings.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Overview of our proposed spike-based audio processing pipeline. The framework illustrates the complete processing flow from the raw audio input through time flow coding to the final classification, highlighting the integration of temporal feature extraction and spike-based neural processing.

Figure 1. Overview of our proposed spike-based audio processing pipeline. The framework illustrates the complete processing flow from the raw audio input through time flow coding to the final classification, highlighting the integration of temporal feature extraction and spike-based neural processing.

View Image - Figure 2. Overview of the proposed SATRN architecture. The model consists of three key components: (1) the Spiking Potential Layer (SPL), which encodes input data into spike representations; (2) hierarchical feature fusion (HFF), which integrates global and local features using Potential Feature Fusion (PFF) modules; and (3) Spatio-Temporal Self-Attention (STSA), which captures spatial and temporal dependencies for enhanced feature representation. These components work together to enable efficient and effective spike-based neural network processing.

Figure 2. Overview of the proposed SATRN architecture. The model consists of three key components: (1) the Spiking Potential Layer (SPL), which encodes input data into spike representations; (2) hierarchical feature fusion (HFF), which integrates global and local features using Potential Feature Fusion (PFF) modules; and (3) Spatio-Temporal Self-Attention (STSA), which captures spatial and temporal dependencies for enhanced feature representation. These components work together to enable efficient and effective spike-based neural network processing.

View Image - Figure 3. Robustness evaluation on FSD50K across different SNR levels. Results demonstrate consistent performance advantage of time flow coding over static encoding, especially in challenging noise conditions.

Figure 3. Robustness evaluation on FSD50K across different SNR levels. Results demonstrate consistent performance advantage of time flow coding over static encoding, especially in challenging noise conditions.

View Image - Figure 4. Performance comparison under varying noise levels on UrbanSound8k. Our time flow coding approach maintained higher accuracy across all SNR values, with particularly significant advantages in high-noise conditions (SNR [less than] 0 dB).

Figure 4. Performance comparison under varying noise levels on UrbanSound8k. Our time flow coding approach maintained higher accuracy across all SNR values, with particularly significant advantages in high-noise conditions (SNR [less than] 0 dB).

Table 1

Network architecture details. Each stage follows MobileNet’s inverted bottleneck structure with parallel SNN path. Both paths maintain same output dimensions for PFF.

Block	Structure	Output Size
InputConv1	- $3 \times 3$ , 32	$B \times T \times 32 \times H \times W$
Stage 1	conv $1 \times 1$ , 128piking neuronconv $3 \times 3$ , 128, $s = 2$ Spiking neuron $1 \times 1$ , 64	$B \times T \times 64 \times H / 2 \times W / 2$
Stage 2	$1 \times 1$ , 256Spiking neuron $3 \times 3$ , 256, $s = 2$ Spiking neuron $1 \times 1$ , 128	$B \times T \times 128 \times H / 4 \times W / 4$
Stage 3	$1 \times 1$ , 512Spiking neuron $3 \times 3$ , 512, $s = 2$ Spiking neuron $1 \times 1$ , 256	$B \times T \times 256 \times H / 8 \times W / 8$
Stage 4	$1 \times 1$ , 1024Spiking neuron $3 \times 3$ , 1024, $s = 2$ Spiking neuron $1 \times 1$ , 512	$B \times T \times 512 \times H / 16 \times W / 16$
Stage 5	$1 \times 1$ , 2048 Spiking neuron $3 \times 3$ , 2048, $s = 2$ Spiking neuron $1 \times 1$ , 1024	$B \times T \times 1024 \times H / 32 \times W / 32$

Table 2

Ablation study comparing TSC and TFC on “UrbanSound8K”. The accuracy on “UrbanSound8K” is presented.

Related Works	Model	Step	Unit (s)	UrbanSound8K (Accuracy)
[7]	Spikeformer-TFC	8	0.5	93.8%
[7]	Spikeformer-TSC	8	4	92.9%
This work	SATRN-TFC-STSA	8	0.5	95.5%
This work	SATRN-TSC-STSA	8	4	95.0%
This work	SATRN-TFC-STSA	4	0.5	93.0%
This work	SATRN-TSC-STSA	4	2	92.7%

Table 3

Comparison with state-of-the-art models on Urbansound8K.

Model	Acc (%)
ESResNeXt [41]	89.14
CAT [37]	95.90
MCLNN [42]	73.30
SATRN-TSC-STSA	95.0
SATRN-TFC-STSA	95.5

Table 4

Ablation study comparing TSC and TFC on “FSD50K”. The mAP on “FSD50K” is presented.

Related Works	Model	Step	Unit(s)	FSD50K (mAP)
[7]	Spikeformer-TFC	4	2	43.3%
[7]	Spikeformer-TSC	4	8	43.0%
This work	SATRN-TFC-STSA	4	2	45.5%
This work	SATRN-TSC-STSA	4	8	44.5%
This work	SATRN-TFC-STSA	8	1	45.0%
This work	SATRN-TSC-STSA	8	8	44.5%

Table 5

Comparison with state-of-the-art ANN models on “FSD50K”. The mAP on “FSD50K” is presented.

Related Works	Model	FSD50K (mAP)
[11]	CRNN	0.417
[11]	VGG-like	0.434
[11]	ResNet-18	0.373
[11]	DenseNet-121	0.425
[43]	Wav2CLIP	0.431
This work	SATRN-TFC-STSA	0.455

Table 6

The performance of our proposed method was compared with other techniques in the field of SNNs. The accuracy on “UrbanSound8K” and mAP on “FSD50K” are presented.

Related Works	Model	UrbanSound8K (Accuracy)	FSD50K (mAP)
[7]	Spikeformer	93.8%	43.3%
This work	SATRN-TFC-SSA	95.1%	44.7%
This work	SATRN-TFC-STSA	95.5%	45.5%

References

1. Liu, K.; Cui, X.; Ji, X.; Kuang, Y.; Zou, C.; Zhong, Y.; Xiao, K.; Wang, Y. Real-Time Target Tracking System with Spiking Neural Networks Implemented on Neuromorphic Chips. IEEE Trans. Circuits Syst. II Express Briefs; 2023; 70, pp. 1590-1594. [DOI: https://dx.doi.org/10.1109/TCSII.2022.3227121]

2. Hu, S.G.; Qiao, G.C.; Liu, X.K.; Liu, Y.H.; Zhang, C.M.; Zuo, Y.; Zhou, P.; Liu, Y.A.; Ning, N.; Yu, Q. et al. A Co-Designed Neuromorphic Chip with Compact (17.9K F2) and Weak Neuron Number-Dependent Neuron/Synapse Modules. IEEE Trans. Biomed. Circuits Syst.; 2022; 16, pp. 1250-1260. [DOI: https://dx.doi.org/10.1109/TBCAS.2022.3209073] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36150001]

3. Barchid, S.; Allaert, B.; Aissaoui, A.; Mennesson, J.; Djéraba, C. Spiking-Fer: Spiking Neural Network for Facial Expression Recognition with Event Cameras. arXiv; 2023; arXiv: 2304.10211

4. Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA, 5–9 March 2017; pp. 776-780. [DOI: https://dx.doi.org/10.1109/ICASSP.2017.7952261]

5. Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process.; 1980; 28, pp. 357-366. [DOI: https://dx.doi.org/10.1109/TASSP.1980.1163420]

6. Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going Deeper in Spiking Neural Networks: VGG and Residual Architectures. Front. Neurosci.; 2019; 13, 95. [DOI: https://dx.doi.org/10.3389/fnins.2019.00095]

7. Zhou, Z.; Zhu, Y.; He, C.; Wang, Y.; Yan, S.; Tian, Y.; Yuan, L. Spikformer: When Spiking Neural Network Meets Transformer. Proceedings of the International Conference on Learning Representations (ICLR); Kigali, Rwanda, 1–5 May 2023.

8. Dellaferrera, G.; Martinelli, F.; Cernak, M. A Bin Encoding Training of a Spiking Neural Network Based Voice Activity Detection. Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Barcelona, Spain, 4–8 May 2020.

9. Ustubioglu, B.; Tahaoglu, G.; Ulutas, G. Detection of audio copy-move-forgery with novel feature matching on Mel spectrogram. Expert Syst. Appl.; 2023; 213, 118963. [DOI: https://dx.doi.org/10.1016/j.eswa.2022.118963]

10. Salamon, J.; Jacoby, C. A Dataset and Taxonomy for Urban Sound Research. 2014; Available online: https://zenodo.org/record/1203745/ (accessed on 13 February 2025).

11. Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM Trans. Audio Speech Lang. Process.; 2022; 30, pp. 829-852. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3133208]

12. Qawaqneh, Z.; Mallouh, A.A.; Barkana, B.D. Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl.-Based Syst.; 2017; 115, pp. 5-14. [DOI: https://dx.doi.org/10.1016/j.knosys.2016.10.008]

13. Izhikevich, E.M. Simple model of spiking neurons. IEEE Trans. Neural Netw.; 2003; 14, pp. 1569-1572. [DOI: https://dx.doi.org/10.1109/TNN.2003.820440]

14. Na, B.; Mok, J.; Park, S.; Lee, D.; Choe, H.; Yoon, S. AutoSNN: Towards Energy-Efficient Spiking Neural Networks. Proceedings of the 39th International Conference on Machine Learning; Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; Sabato, S. 2022; Volume 162, pp. 16253-16269.

15. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature; 1986; 323, pp. 533-536. [DOI: https://dx.doi.org/10.1038/323533a0]

16. Kim, Y.; Panda, P. Optimizing deeper spiking neural networks for dynamic vision sensing. Neural Netw.; 2021; 144, pp. 686-698. [DOI: https://dx.doi.org/10.1016/j.neunet.2021.09.022] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34662827]

17. Ding, J.; Yu, Z.; Tian, Y.; Huang, T. Optimal ANN-SNN Conversion for Fast and Accurate Inference in Deep Spiking Neural Networks. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); Montreal, QC, Canada, 19–27 August 2021; pp. 2328-2336.

18. Han, B.; Srinivasan, G.; Roy, K. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 14–19 June 2020; pp. 13558-13567.

19. Deng, L.; Wu, Y.; Hu, X.; Liang, L.; Ding, Y.; Li, G.; Zhao, G.; Li, P.; Xie, Y. Rethinking the performance comparison between SNNS and ANNS. Neural Netw.; 2020; 121, pp. 294-307. [DOI: https://dx.doi.org/10.1016/j.neunet.2019.09.005] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31586857]

20. Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Shi, L. Spatio-Temporal Backpropagation for Training High-Performance Spiking Neural Networks. Front. Neurosci.; 2018; 12, 331. [DOI: https://dx.doi.org/10.3389/fnins.2018.00331] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29875621]

21. Lee, J.H.; Delbruck, T.; Pfeiffer, M. Training Deep Spiking Neural Networks Using Backpropagation. Front. Neurosci.; 2016; 10, 508. [DOI: https://dx.doi.org/10.3389/fnins.2016.00508] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27877107]

22. Shrestha, S.B.; Orchard, G. Slayer: Spike layer error reassignment in time. Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS); Montreal, QC, Canada, 3–8 December 2018; Volume 31.

23. Zenke, F.; Ganguli, S. Superspike: Supervised learning in multilayer spiking neural networks. Neural Comput.; 2018; 30, pp. 1514-1541. [DOI: https://dx.doi.org/10.1162/neco_a_01086]

24. Fang, W.; Yu, Z.; Chen, Y.; Masquelier, T.; Huang, T.; Tian, Y. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. Proceedings of the IEEE/CVF International Conference on Computer Vision; Virtual, 11–17 October 2021; pp. 2661-2671.

25. Xu, Y.; Huang, Q.; Wang, W.; Foster, P.; Sigtia, S.; Jackson, P.J.B.; Plumbley, M.D. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Trans. Audio Speech Lang. Process.; 2017; 25, pp. 1230-1241. [DOI: https://dx.doi.org/10.1109/TASLP.2017.2690563]

26. Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. Proceedings of the ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing; Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 208-211. [DOI: https://dx.doi.org/10.1109/ICASSP.1979.1170788]

27. Cohen, I.; Berdugo, B. Speech enhancement for non-stationary noise environments. Signal Process.; 2001; 81, pp. 2403-2418. [DOI: https://dx.doi.org/10.1016/S0165-1684(01)00128-1]

28. Fang, H.; Carbajal, G.; Wermter, S.; Gerkmann, T. Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder. Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Toronto, ON, Canada, 6–11 June 2021; [DOI: https://dx.doi.org/10.1109/icassp39728.2021.9414060]

29. Le Roux, J.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR—Half-baked or well done?. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); Brighton, UK, 12–17 May 2019; pp. 626-630. [DOI: https://dx.doi.org/10.1109/ICASSP.2019.8682822]

30. Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process.; 2020; 28, pp. 2880-2894. [DOI: https://dx.doi.org/10.1109/TASLP.2020.3030497]

31. Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. Proceedings of the Interspeech 2022, ISCA; Incheon, Republic of Korea, 18–22 September 2022; [DOI: https://dx.doi.org/10.21437/interspeech.2022-227]

32. Desimone, R.; Duncan, J. Neural mechanisms of selective visual attention. Annu. Rev. Neurosci.; 1995; 18, pp. 193-222. [DOI: https://dx.doi.org/10.1146/annurev.ne.18.030195.001205]

33. Yu, C.; Gu, Z.; Li, D.; Wang, G.; Wang, A.; Li, E. STSC-SNN: Spatio-Temporal Synaptic Connection with temporal convolution and attention for spiking neural networks. Front. Neurosci.; 2022; 16, 1079357. [DOI: https://dx.doi.org/10.3389/fnins.2022.1079357]

34. Yao, M.; Zhao, G.; Zhang, H.; Hu, Y.; Deng, L.; Tian, Y.; Xu, B.; Li, G. Attention Spiking Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45, pp. 9393-9410. [DOI: https://dx.doi.org/10.1109/TPAMI.2023.3241201] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37022261]

35. Jiang, C.; Zhang, Y. KLIF: An optimized spiking neuron unit for tuning surrogate gradient slope and membrane potential. arXiv; 2023; arXiv: 2302.09238

36. He, W.; Wu, Y.; Deng, L.; Li, G.; Wang, H.; Tian, Y.; Ding, W.; Wang, W.; Xie, Y. Comparing SNNs and RNNs on neuromorphic vision datasets: Similarities and differences. Neural Netw.; 2020; 132, pp. 108-120. [DOI: https://dx.doi.org/10.1016/j.neunet.2020.08.001] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32866745]

37. Liu, X.; Lu, H.; Yuan, J.; Li, X. CAT: Causal Audio Transformer for Audio Classification. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Rhodes, Greece, 4–10 June 2023; pp. 1-5. [DOI: https://dx.doi.org/10.1109/ICASSP49357.2023.10096787]

38. Gong, Y.; Chung, Y.A.; Glass, J. PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 29, pp. 3292-3306. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3120633]

39. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510-4520. [DOI: https://dx.doi.org/10.1109/CVPR.2018.00474]

40. Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence. Sci. Adv.; 2023; 9, eadi1480. [DOI: https://dx.doi.org/10.1126/sciadv.adi1480]

41. Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio. Proceedings of the International Joint Conference on Neural Networks (IJCNN); Shenzhen, China, 18–22 July 2021; pp. 1-8. [DOI: https://dx.doi.org/10.1109/IJCNN52387.2021.9533654]

42. Medhat, F.; Chesmore, D.; Robinson, J. Masked Conditional Neural Networks for Environmental Sound Classification. Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 21-33. [DOI: https://dx.doi.org/10.1007/978-3-319-71078-5_2]

43. Wu, H.H.; Seetharaman, P.; Kumar, K.; Bello, J.P. Wav2CLIP: Learning Robust Audio Representations From CLIP. Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Singapore, 7–13 May 2022.

Word count: 7458

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Audio tagging, as a fundamental task in acoustic signal processing, has demonstrated significant advances and broad applications in recent years. Spiking Neural Networks (SNNs), inspired by biological neural systems, exploit event-driven computing paradigms and temporal information processing, enabling superior energy efficiency. Despite the increasing adoption of SNNs, the potential of event-driven encoding mechanisms for audio tagging remains largely unexplored. This work presents a pioneering investigation into event-driven encoding strategies for SNN-based audio tagging. We propose the SATRN (Spiking Audio Tagging Robust Network), a novel architecture that integrates temporal–spatial attention mechanisms with membrane potential residual connections. The network employs a dual-stream structure combining global feature fusion and local feature extraction through inverted bottleneck blocks, specifically designed for efficient audio processing. Furthermore, we introduce an event-based encoding approach that enhances the resilience of Spiking Neural Networks to disturbances while maintaining performance. Our experimental results on the Urbansound8k and FSD50K datasets demonstrate that the SATRN achieves comparable performance to traditional Convolutional Neural Networks (CNNs) while requiring significantly less computation time and showing superior robustness against noise perturbations, making it particularly suitable for edge computing scenarios and real-time audio processing applications.

Details

Title

SATRN: Spiking Audio Tagging Robust Network

Author

Gao, Shouwei; Deng, Xingyang; Fan, Xiangyu; Yu, Pengliang; Zhou, Hao

; Zhu, Zihao

First page

761

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14040761

ProQuest document ID

3171006522

SATRN: Spiking Audio Tagging Robust Network

Jump to:

Full text

1.1. Key Challenges in Audio Processing

1.2. Advantages of SNN-Based Approach

1.3. Our Contributions

2. Related Work

2.1. Advances in Spiking Neural Networks

2.2. Audio Event Detection and Classification

2.3. Neural Architectures for Audio Processing

2.4. Research Opportunities and Our Approach

3. Materials and Methods

3.1. Spiking Neuron Model

3.2. Architectural Comparison with Recurrent Networks

3.3. Time Flow Coding

3.4. SATRN Architecture

3.5. Spike Time–Space Attention

3.5.1. Architectural Design

3.5.2. Attention Mechanism Analysis

3.5.3. Spike-Based Processing

3.6. Computational Efficiency Analysis

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.2. Performance Analysis

4.2.1. Evaluation on Urbansound8K

4.2.2. Evaluation on FSD50K

4.3. Robustness Analysis

4.3.1. Experimental Setup

4.3.2. Key Findings

4.3.3. Performance Analysis

5.1. Methodological Innovations

5.2. Practical Implications

5.3. Future Research Directions

Abstract

Details

Suggested sources