Content area
The increasing popularity of multimedia applications, such as video classification, has underscored the need for efficient methods to manage and categorize vast video datasets. Video classification simplifies video categorization, enhancing searchability and retrieval by leveraging distinctive features extracted from textual, audio, and visual components. This paper introduces an automated video recognition system that classifies video content based on motion types (low, medium, and high) derived from visual component characteristics. The proposed system utilizes advanced artificial intelligence techniques with four feature extraction methods; MFCC alone, (2) MFCC after applying DWT, (3) denoised MFCC, and (4) MFCC after applying denoised DWT. And seven classification algorithms to optimize accuracy. A novel aspect of this study is the application of Mel Frequency Cepstral Coefficients (MFCC) to extract features from the video domain rather than their traditional use in audio processing, demonstrating the effectiveness of MFCC for video classification. Seven classification techniques, including K-Nearest Neighbors (KNN), Radial Basis Function Support Vector Machines (SVM-RBF), Parzen Window Method, Neighborhood Components Analysis (NCA), Multinomial Logistic Regression (ML), Linear Support Vector Machines (SVM Linear), and Decision Trees (DT), are evaluated to establish a robust classification framework. Experimental results indicate that this denoising-enhanced system significantly improves classification accuracy, providing a comprehensive framework for future applications in multimedia management and other fields.
Introduction
The revolution in internet and multimedia technologies has significantly expanded the use of video content in daily life across various domains, including video archiving, surveillance, digital libraries, video analytics, indexing, and retrieval systems [1, 2]. This exponential growth necessitates efficient mechanisms to manage and search through large video collections to meet user demands. Consequently, video classification has emerged as a critical tool, simplifying the identification of relevant content by organizing it into meaningful categories [3, 4]. Such classification enables faster and more effective searches within extensive video databases.
A comprehensive understanding of video stream structure is essential for developing efficient classification algorithms. A video stream typically consists of three hierarchical elements: frames, shots, and scenes [5, 6], as shown in Fig. 1. A frame—the fundamental unit of video—is grouped into shots, which represent continuous sequences captured by a single camera. A new shot begins when the camera angle or scene changes. Transitions between shots are generally categorized as either hard cuts (abrupt transitions) or gradual fades (smooth transitions) [7, 8]. Scenes, which are composed of related shots, usually reflect a shift in time or location [9].
Smart video classification refers to advanced methodologies that automatically categorize video content using artificial intelligence (AI), machine learning (ML), and deep learning. This area has gained substantial traction due to the explosive growth in video data, making automated classification essential for applications such as surveillance, content management, and autonomous systems. Recent studies have explored techniques like.
[See PDF for image]
Fig. 1
Video stream components
Convolutional neural networks (CNNs), Long Short-Term Memory (LSTM) networks, and 3D CNNs for effective spatial and temporal analysis [10, 11]. Additionally, multimodal approaches that integrate visual and audio modalities have demonstrated strong potential in improving classification accuracy in complex scenarios like event detection and emotion recognition [12, 13].
Despite these advances, several challenges persist in smart video classification, including scalability, video quality variability, and the resource-intensive task of generating labeled datasets for supervised learning [14]. Emerging methods such as transfer learning and Transformer-based architectures have begun to mitigate these issues, offering improvements in efficiency and generalization [12].
This study focuses on motion-intensity-based video classification, categorizing videos into three levels: low, medium, and high motion. Unlike prior works that rely heavily on semantic features derived from textual, audio, or visual elements, the proposed method introduces a hybrid visual-audio pipeline using advanced AI techniques [15]. Specifically, the framework employs Mel Frequency Cepstral Coefficients (MFCCs) and Discrete Wavelet Transform (DWT) for robust feature extraction.
Although MFCCs are traditionally used in speech and audio signal processing, their effectiveness lies in analyzing frequency components of one-dimensional (1D) time-dependent signals. In the proposed system, 3D video sequences are transformed into a 1D motion signal by aggregating pixel-level variations across frames. This transformation allows motion patterns to be analyzed in a similar fashion to audio signals, making MFCCs highly suitable for video feature extraction.
MFCCs offer a compact and discriminative representation of temporal dynamics by capturing the spectral envelope of the signal. The resulting motion signal reveals frequency-like patterns indicative of motion speed, intensity, and regularity—essential for distinguishing between low, medium, and high motion categories. To enhance the discriminative power of the extracted features, the system integrates DWT, which captures multi-resolution temporal-frequency characteristics. Furthermore, denoising is applied to suppress irrelevant variations and improve feature robustness. The final feature set includes MFCCs extracted directly from the motion signal, as well as those derived after applying DWT and denoising.
The extracted features are evaluated using seven well-established classification models: K-Nearest Neighbors (KNN), Support Vector Machines with Radial Basis Function kernel (SVM-RBF), Parzen Window, Neighborhood Components Analysis (NCA), Multinomial Logistic Regression (ML), Support Vector Machines with Linear kernel (SVM-Linear), and Decision Trees (DT). This comprehensive comparison allows identification of the optimal feature-classifier combination for accurate motion-intensity classification.
The remainder of this paper is structured as follows: Sect. 2 reviews related work, Sect. 3 describes the proposed methodology, Sect. 4 presents the experimental results and analysis, and Sect. 5 concludes the study with key findings and directions for future research.
Related work
The field of video classification has undergone substantial advancements with the integration of deep learning techniques and multimodal data analysis, supporting applications such as action recognition, anomaly detection, and sentiment analysis. This section provides an overview of key contributions and recent developments in the domain.
Recent studies have explored 3D Convolutional Neural Networks (3D CNNs) for modeling spatio-temporal dependencies in video streams. For example, Daoud et al. (2024) demonstrated the effectiveness of 3D CNNs for dynamic pattern recognition through a one-stream architecture applied to fire detection in video surveillance scenarios [16]. These models capture both spatial and temporal features, which are essential for recognizing complex motion dynamics.
To address challenges related to limited annotated data, few-shot learning has gained attention. Wei et al. (2024) introduced a semantic-guided multimodal fusion framework, SVMFN-FSAR, which enables action classification from minimal samples by incorporating semantic priors [17]Similarly, Gan et al. (2024) applied transfer learning for micro-expression recognition using dense connectivity and attention mechanisms to enhance classification accuracy with sparse data [18].
Multimodal sentiment analysis represents another prominent direction. Do et al. (2024) combined deep learning with fuzzy logic to integrate textual, auditory, and visual data for robust sentiment classification [19]. Further extending multimodal exploration, Brás et al. (2024) investigated the use of physiological signals—such as electromyography (EMG) and electrodermal activity (EDA)—for emotion and pain recognition, highlighting the potential of physiological cues in video-based applications [20].
Benchmark development has also propelled progress in the field. Chen et al. (2024) presented YourSkatingCoach, a domain-specific benchmark designed for fine-grained analysis of figure skating elements, supporting specialized model training and evaluation [21]. In parallel, Narayani (2024) emphasized the importance of anomaly detection datasets in advancing surveillance video analysis through deep learning [22].
Innovative model architectures have further strengthened video classification. Chappa et al. (2024) proposed the FLAASH framework, leveraging attention mechanisms for multimodal video question answering and achieving state-of-the-art performance across standard benchmarks [22]. Ruhi et al. (2024) extended attention-based transfer learning to the medical domain, demonstrating high accuracy in classifying bone fractures from radiographic video sequences [21].
In summary, recent advancements in video classification are characterized by robust deep learning architectures, few-shot learning, multimodal fusion strategies, and the emergence of task-specific benchmarks. These contributions collectively provide a solid foundation for future research targeting real-time, low-resource, and application-specific video understanding tasks.
Proposed algorithms
In response to the growing need for efficient and scalable video classification systems, this study proposes a lightweight yet robust framework that classifies video segments based on motion intensity. Unlike traditional methods that rely heavily on deep semantic analysis or large-scale training datasets, the proposed system focuses on low-level motion dynamics derived from visual information, offering a computationally efficient solution suitable for real-time or resource-constrained environments.
The design of the algorithm is centered around a three-phase pipeline: motion signal extraction, multi-level feature engineering, and supervised classification. A distinctive aspect of this approach lies in the adaptation of audio-domain techniques—specifically Mel-Frequency Cepstral Coefficients (MFCC)—to analyze motion trends in video data. By transforming three-dimensional video sequences into a one-dimensional motion signal, the system enables the application of temporal signal processing techniques for effective feature extraction and classification.
Automatic recognition system
The proposed automatic recognition system operates in three integrated stages: feature extraction, model training and testing, and classification. As illustrated in Fig. 2, the system begins by converting the input video stream into a one-dimensional motion signal that captures frame-wise pixel-level variations over time. This transformation enables the system to interpret motion patterns in a manner analogous to temporal signals in audio processing.
[See PDF for image]
Fig. 2
The proposed automatic recognition system
Key features are initially extracted using the Mel-Frequency Cepstral Coefficients (MFCC) technique, which analyzes the frequency content of the motion signal. To enhance the discriminative power of these features, the system further incorporates the Discrete Wavelet Transform (DWT), enabling the capture of both high- and low-frequency components of motion dynamics. Additionally, a wavelet-based denoising process is applied to improve feature robustness by mitigating the impact of signal noise. Consequently, the system generates four distinct feature configurations: (1) MFCC from the raw motion signal, (2) MFCC after applying DWT, (3) denoised MFCC, and (4) MFCC after applying denoised DWT.
Figure 3. Internal structure of the feature extraction pipeline illustrating the transformation from 3D video frames to a 1D motion signal, followed by successive stages of MFCC extraction, wavelet decomposition, and denoising.
In the final stage, the extracted feature vectors are fed into a classification model that assigns the video to one of three predefined motion classes: low, medium, or high motion intensity [15]. This classification step supports various downstream applications, such as video indexing, retrieval, and content-based filtering, particularly in domains where motion intensity serves as a key descriptor.
[See PDF for image]
Fig. 3
Video classification automated system
Feature extraction and classification
The proposed video classification framework employs a comprehensive feature extraction pipeline to enhance the representation of motion information. Four primary strategies are adopted in the process Mel-Frequency Cepstral Coefficients (MFCC) alone, MFCC applied after Discrete Wavelet Transform (DWT), denoised MFCC, and (4) MFCC applied after denoised DWT. These methods collectively enrich the feature vector, allowing for more robust classification by capturing complementary aspects of the input signal.
Mel-frequency cepstral coefficients (MFCC)
The MFCC method is widely used in signal processing due to its ability to replicate human auditory perception. It translates the real frequency scale into the perceptually motivated Mel scale, as detailed in references [22, 23]. The conversion from frequency f in Hz to the Mel scale s is performed using the following formula:
1
To minimize signal distortion at frame boundaries, a Hamming window Wn(s) is applied:
2
.Here, Y(s) is the windowed signal, X(s) is the original framed signal, and Ns denotes the number of samples per frame. Following windowing, the signal is transformed into the frequency domain using the Fast Fourier Transform (FFT) [22]The resulting spectrum is filtered through a triangular Mel filter bank to simulate auditory band-pass filtering. The output is then logarithmically scaled and transformed using the Discrete Cosine Transform (DCT), resulting in MFCC coefficients that serve as the final features:
3
Where Z is the number of filters, sz is the output of the z-th filter, N is the number of coefficients (typically N = 13), and cn represents the cepstral coefficients that characterize the 1D motion signal. In summary, the MFCC extraction process includes: converting the 3D video to a 1D motion signal, applying a Hamming window to reduce high-frequency artifacts, performing FFT to shift to the spectral domain, applying Mel-scale filtering, then logarithmic scaling and DCT to produce the MFCC vector.
Discrete wavelet transform (DWT)
The Discrete Wavelet Transform is introduced to complement MFCC by providing a multi-resolution analysis of the signal. DWT decomposes the signal into high-pass and low-pass components, representing detailed and approximated versions of the signal respectively. This hierarchical representation enables localized analysis of both frequency and time, which is particularly effective for video signals with varying temporal dynamics.
DWT applies recursive decomposition, known as analysis, to separate the signal into approximation and detail coefficients. These components can later be reconstructed without loss through synthesis. The ability of DWT to extract features at multiple frequency levels makes it an effective tool to capture transient changes in motion, which may be overlooked by MFCC alone. Therefore, combining DWT-derived features with MFCC results in a more descriptive and discriminative feature set for video classification.
Wavelet-based denoising
To enhance noise robustness, especially in scenarios involving low-quality or compressed video content, wavelet-based denoising is incorporated into the feature extraction process [24]. The denoising process begins with wavelet decomposition using a selected filter, followed by thresholding—either soft or hard—applied to the wavelet coefficients to suppress noise components. Finally, the denoised signal is reconstructed using the inverse DWT. MFCCs are then re-computed from the denoised signal, resulting in enhanced feature stability and robustness against noise interference.
Classification and selection
Phase involves classifying the motion intensity of the input video. This classification process assigns each video to one of three motion categories: high, medium, or low motion intensity. As illustrated in Fig. 4, the system takes the extracted one-dimensional feature vector—derived from the motion signal—and feeds it into the selected classification model. The classifier then analyzes the temporal and spectral patterns within the feature vector to determine the corresponding motion class.
This stage plays a crucial role in the overall pipeline, as accurate classification depends heavily on the quality and discriminative power of the extracted features. The integration of denoising and MFCC-based feature representation ensures robustness against noise and enhances the classifier’s ability to distinguish between different motion levels. Figure 4 visually encapsulates this process, demonstrating the transformation from raw 3D video data to a simplified 1D signal representation, followed by feature extraction and final classification into motion intensity labels.
[See PDF for image]
Fig. 4
Converting 3D video data into 1D motion signals with subsequent feature extraction and classification into motion intensity levels
Experimental results and discussion
The main principle behind the proposed classification framework is based on aggregating sequential video frames into a single representative frame, denoted as F. This is accomplished by averaging the corresponding pixel values across all frames using the following formula:
4
whereis the pixel value located at (x, y) in the new frame F, Ii(x, y) is the pixel value at the same position in frame iii, and NFN_FNF is the total number of frames. Finally, the new frame is converted into a one-dimensional signal, as illustrated in Fig. 4.
The dataset used to validate the proposed system comprises a total of 741 videos, categorized into three motion classes: 354 high-motion, 205 medium-motion, and 182 low-motion videos. A 10-fold cross-validation scheme was employed for training and evaluation, whereby 90% of the data was allocated for training and the remaining 10% for testing in each fold. The video sequences were sourced from the Xiph.org Video Test Media database, a well-established and publicly available collection used widely in multimedia research. This dataset includes video content with a diverse range of temporal and spatial characteristics, available in multiple resolutions and frame rates, such as 4CIF at 60 frames per second (fps), CIF at 30 fps, and QCIF at 15 fps. File sizes range from 5.5 MB to 349 MB, and many sequences are also provided in monochrome QCIF versions. The varied nature of the video content—encompassing natural landscapes, human activities, surveillance environments, and urban settings—ensures a comprehensive representation of motion complexity and scene dynamics. This heterogeneity plays a key role in demonstrating the generalizability and robustness of the proposed classification model across multiple motion intensity levels without relying on semantic labels.
In accordance with the methodology described in the previous section, four distinct feature Building on the methodology outlined in the previous section, we conducted a comprehensive evaluation of four distinct feature extraction strategies designed to capture motion characteristics from video sequences. These strategies include: (1) direct extraction of Mel-Frequency Cepstral Coefficients (MFCC) from the raw motion signal, (2) application of Discrete Wavelet Transform (DWT) prior to MFCC computation, (3) signal denoising.
[See PDF for image]
Fig. 5
Recognition rate to SNR for video motion type
Followed by MFCC extraction, and (4) a combined approach involving DWT and denoising prior to MFCC computation. These methods were selected to assess how preprocessing impacts feature robustness, particularly in noisy environments.
The effectiveness of each method was examined under varying levels of signal-to-noise ratio (SNR), simulating realistic scenarios where motion signals may be corrupted by background noise. As shown in Fig. 5, the denoised MFCC method consistently delivered the highest recognition accuracy, especially in low SNR ranges (0–20 dB), where noise severely degrades classification performance. Although the denoised DWT-MFCC.
method also showed relatively strong performance, it remained slightly inferior to the denoised MFCC. These findings emphasize the critical role of noise reduction in enhancing the quality of extracted motion features, thereby improving overall system robustness.
Following feature extraction, the resulting feature vectors were input into seven well-established classification algorithms: K-Nearest Neighbors (KNN), Support Vector Machines with radial basis function (SVM-RBF) and linear (SVM-Linear) kernels, Neighborhood Components Analysis (NCA), Parzen Window, Decision Trees (DT), and Multinomial Logistic Regression (ML). The classification accuracy for each combination of feature extraction method and classifier is summarized in Table 1. These results were further visualized across a series of comparative plots (Figs. 6, 7, 8 and 9) to highlight trends in classifier performance.
In terms of classifier performance, the combination of denoised MFCC features with the KNN classifier achieved the highest overall accuracy, reaching 98.68%, followed closely by the NCA classifier using the same feature set. Across nearly all classifier configurations, models that utilized denoised MFCC features consistently outperformed those that relied on other feature extraction methods. This consistency reinforces the utility of noise suppression prior to feature computation as a key design choice in motion classification tasks.
Additional insights are provided in Figs. 6, 7, 8 and 9, which illustrate the classifier-wise accuracy for each feature extraction method. Specifically, Fig. 6 demonstrates the clear advantage of denoised MFCC across all classifiers. Figure 7 shows that while standard MFCC achieves moderate performance, its sensitivity to noise leads to noticeable drops in accuracy. Figures 8 and 9 further confirm that while DWT-based enhancements improve robustness to some extent, they do not surpass the performance gains achieved through pure denoising.
To consolidate the overall comparative analysis, Fig. 10 presents a unified performance comparison of all classifiers across all feature extraction methods. This figure confirms the superiority of the denoised MFCC strategy, particularly when paired with simple yet effective classifiers such as KNN. The findings demonstrate that denoising not only boosts.
classification accuracy but also improves consistency across model architectures. In conclusion, the experimental results establish that the denoised MFCC approach offers a robust and computationally efficient feature extraction pipeline for motion-intensity classification. It excels across a wide range of classifiers and proves resilient in noisy conditions, making it particularly well-suited for real-time and embedded applications in video analysis domains.
Table 1. Classification accuracies of feature extraction methods
Classification methods | Denoised MFCC | MFCC | MFCC of DWT | MFCC of Denoised DWT |
|---|---|---|---|---|
KNN | 98.65 | 93.24 | 87.84 | 95.95 |
NCA | 97.3 | 91.89 | 89.19 | 94.59 |
SVM-RBF | 95.95 | 91.89 | 93.24 | 94.59 |
SVM-LINEAR | 94.59 | 93.24 | 87.84 | 93.24 |
Parzen | 94.59 | 94.59 | 89.19 | 94.59 |
DT | 95.95 | 91.89 | 86.49 | 91.89 |
ML | 93.24 | 90.54 | 89.19 | 91.89 |
[See PDF for image]
Fig. 6
Classification accuracies of Denoised MFCC Method
[See PDF for image]
Fig. 7
Classification accuracies of MFCC Method
[See PDF for image]
Fig. 8
Classification accuracies of MFCC of DWT Method
[See PDF for image]
Fig. 9
Classification accuracies of MFCC of Denoise DWT method
[See PDF for image]
Fig. 10
Comparison of classification accuracies for the seven classifiers
Class-wise performance evaluation
To further validate the reliability and robustness of the proposed motion-intensity classification framework, a detailed class-wise performance evaluation was performed. The evaluation was based on an imbalanced dataset composed of 354 high-motion, 205 medium-motion, and 182 low-motion video sequences. Such class imbalance necessitates deeper analysis beyond overall accuracy, as reliance solely on global metrics may obscure class-specific weaknesses.
In addition to the overall classification accuracy of 98.65%, standard evaluation metrics—Precision, Recall, and F1-score—were calculated separately for each motion intensity class. As reported in Table 2, the proposed model maintained high consistency across all categories. For high-motion sequences, the model achieved a precision of 98.87%, a recall of 98.59%, and an F1-score of 98.73%. Meanwhile, for medium-motion sequences, it recorded 97.57% precision, 98.05% recall, and a 97.81% F1-score. In the case of low-motion sequences, the model exhibited 97.80% precision, 97.80% recall, and an F1-score of 97.80%.
The average F1-score across all classes was 98.11%, which confirms the model’s ability to perform uniformly well despite the imbalance in data distribution. This level of performance indicates strong generalization capabilities and high discrimination between classes, even when class boundaries are subtle.
To visually support the numerical evaluation, the confusion matrix in Fig. 11 illustrates the predicted versus true labels for each motion intensity class. The matrix displays clear diagonal dominance, which is characteristic of accurate classification. Minimal misclassifications occurred, and when they did, they were typically between adjacent motion levels (e.g., medium mistaken as high), suggesting that the errors are contextually reasonable.
In summary, the results demonstrate that the proposed KNN + denoised MFCC approach achieves not only superior classification accuracy but also robust class-specific performance, making it highly suitable for real-time and embedded systems where computational efficiency and reliability are critical.
Table 2. Class-wise Precision, Recall, and F1-score of the proposed motion classification model
Motion class | Precision (%) | Recall (%) | F1-score (%) |
|---|---|---|---|
High Motion | 98.87 | 98.59 | 98.73 |
Medium Motion | 97.57 | 98.05 | 97.81 |
Low Motion | 97.8 | 97.8 | 97.8 |
Average | 98.08 | 98.15 | 98.11 |
[See PDF for image]
Fig. 11
Confusion matrix for three-class motion intensity classification
Comparative evaluation with state-of-the-art deep learning methods
To validate the robustness, accuracy, and computational efficiency of the proposed lightweight video classification framework, a comparative study was conducted against two widely recognized deep learning baselines: MobileNet-LSTM and Tiny TimeSformer. The former represents a convolutional-recurrent hybrid architecture that captures spatiotemporal dynamics through lightweight CNNs followed by Long Short-Term Memory (LSTM) units, while the latter is a compact Transformer-based architecture specifically designed for video data with attention-based temporal modeling. These models were selected for their balance between performance and resource efficiency, making them suitable candidates for low-power video analysis.
All models were trained and evaluated under identical experimental conditions using the Xiph.org Video Test Media dataset. This ensured a fair and controlled comparison, where the only varying factor was the underlying model architecture and the corresponding feature extraction approach.
The results, presented in Table 3, clearly demonstrate the superior classification accuracy achieved by the proposed method. The combination of K-Nearest Neighbors (KNN) with denoised MFCC features reached an accuracy of 98.65%, substantially outperforming MobileNet-LSTM (94.20%) and Tiny TimeSformer (92.80%). Notably, the proposed method also demonstrated significant computational advantages, requiring approximately 5–6 times less training time and memory usage compared to its deep learning counterparts.
These findings underscore the practicality of the proposed framework, especially in scenarios where computational resources are constrained or where real-time response is critical. Unlike deep neural architectures that often require extensive parameter tuning and large-scale annotated datasets, the proposed system achieves high accuracy with minimal overhead. It also avoids reliance on semantic labels or complex neural architectures, making it particularly suitable for deployment in embedded systems, UAV-based surveillance, and edge computing platforms.
Table 3. Performance comparison between the proposed model and deep learning baselines
Model | Accuracy (%) | Training Time (Normalized) | Memory Usage (Normalized) |
|---|---|---|---|
KNN + Denoised MFCC | 98.65 | 1.0× | 1.0× |
MobileNet-LSTM | 94.20 | 5.3× | 4.8× |
Tiny TimeSformer | 92.80 | 6.1× | 5.5× |
Real-world applications and deployment scenarios
The proven efficiency and accuracy of the proposed motion-intensity classification framework highlight its suitability for a broad range of real-world applications that demand lightweight, interpretable, and resource-conscious solutions. In contrast to traditional semantic-based video understanding approaches—which typically require deep learning models, large annotated datasets, and high processing power—motion-intensity-based classification offers a compelling alternative that maintains functionality while significantly reducing complexity.
One of the most prominent applications lies in smart traffic management, where real-time classification of motion intensity can be employed to monitor vehicle flow, detect traffic congestion, or identify anomalies without the need for object-level detection or tracking. This can significantly reduce computational overhead while maintaining operational efficiency.
In the field of healthcare and elderly care, motion classification can be used to detect abrupt movements such as falls or prolonged inactivity, enabling timely intervention without the ethical and privacy concerns associated with continuous video surveillance.
Similarly, low-power embedded systems, such as IoT edge devices or wildlife monitoring stations, can benefit from motion-level analysis by using motion intensity as a trigger mechanism. This allows the system to conserve storage and battery life by avoiding redundant video recording in static scenes.
In fitness and activity monitoring, motion-intensity signals can be leveraged to assess physical activity levels—such as walking, running, or resting—without requiring visual data or GPS tracking, thereby reducing energy consumption and preserving user privacy.
In the domain of sports analytics and video summarization, segments with high motion intensity can be automatically identified as potential highlights, streamlining content curation without necessitating semantic labeling or player tracking.
Overall, the proposed method demonstrates strong potential for deployment in real-world environments that require simplicity, scalability, and responsiveness. Its reliance on robust motion descriptors rather than semantically annotated features positions it as an effective solution for next-generation video analysis in practical settings.
Conclusion and future work
This study introduces a novel video classification system that combines MFCC and DWT for efficient feature extraction and classification. The denoised MFCC method, when paired with KNN, demonstrates superior performance with a classification accuracy of 98.65%. Future work will focus on incorporating advanced feature extraction methods, integrating deep learning architectures, and optimizing the system for real-time applications. By addressing these aspects, the system can evolve into a versatile and intelligent framework for various video analysis tasks.
Conclusion
This paper introduces a robust and innovative video classification method leveraging the combined power of Mel Frequency Cepstral Coefficients (MFCC) and Discrete Wavelet Transform (DWT) techniques. The proposed methodology converts complex 3D video signals into a simplified 1D representation, enabling efficient feature extraction and classification. The key findings from the study include:
Superiority of Denoised MFCC Features: Among the four feature extraction methods explored, the denoised MFCC approach demonstrated the highest accuracy across all classification techniques. This method proved particularly effective in enhancing the signal quality, reducing noise, and capturing essential features necessary for accurate motion classification.
Performance of Classification Techniques: The evaluation of seven classification algorithms revealed that the K-Nearest Neighbors (KNN) method, when combined with denoised MFCC features, achieved the highest accuracy of 98.65%. This highlights KNN’s effectiveness in handling noise and distinguishing motion classes in video signals.
Utility of Combined Features: The integration of MFCC and DWT features provided a rich and diverse feature set, enhancing classification accuracy by leveraging both time-domain and frequency-domain information.
Scalability and Efficiency: The proposed system achieves high accuracy with minimal computational overhead, making it suitable for real-time applications such as video surveillance, content moderation, and video-based autonomous systems.
The proposed methodology offers a novel application of MFCC, traditionally used in speech recognition, to video motion classification. This adaptation demonstrates the versatility of MFCC and its potential for broader applications.
Future work
While the proposed framework has shown high accuracy in motion-type classification, there are several avenues for extending this work. Future studies could investigate hybrid architectures that combine the efficiency of MFCC and DWT-based feature extraction with lightweight deep learning models to capture more subtle motion variations. Expanding the classification categories beyond the current high, medium, and low-motion types—such as differentiating between periodic and non-periodic motion—could further broaden the applicability of the system. Another promising direction is the development of adaptive feature extraction pipelines that adjust processing parameters dynamically based on video quality, resolution, or noise levels. Real-world deployment scenarios, such as drone-based surveillance, industrial process monitoring, and sports performance analysis, could be explored to assess the framework’s robustness under diverse conditions. Additionally, incorporating temporal segmentation could allow classification of mixed-motion videos, where different motion types occur within the same clip, thus improving granularity and usability in practical applications.
Acknowledgements
Not applicable.
Author contributions
All authors worked together throughout the research process, including study design, data collection, result analysis, and manuscript writing. All authors equally contributed and approved the final version of the manuscript.
Funding
This research received no funding from any governmental, private, or non-profit organizations.
Data availability
The data used in this study is publicly available from Xiph.org Video Test Media and can be accessed at: https://media.xiph.org/video/derf/.
Declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Iloga S, Romain O, Bendaouia L, Tchuente M (2014) Musical genres classification using Markov models. In: 2014 International Conference on Audio, Language and Image Processing (ICALIP), pp. 701–705. IEEE
2. Islam, MS; Sultana, S; Roy, UK; Al Mahmud, J. A review on video classification with methods, findings, performance, challenges, limitations and future work. Jurnal Ilmiah Teknik Elektro Komputer Dan Informatika; 2020; 6,
3. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732
4. Krizhevsky, A; Sutskever, I; Hinton, GE. ImageNet classification with deep convolutional neural networks. Commun ACM; 2017; 60,
5. Xia, W et al. Video visualization and visual analytics: A task-based and application-driven investigation. IEEE Trans Circuits Syst Video Technol; 2024; 34,
6. Huang, Z; Qin, Y; Lin, X; Liu, T; Feng, Z; Liu, Y. Motion-driven Spatial and Temporal adaptive high-resolution graph convolutional networks for skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol; 2022; 33,
7. Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1717–1724
8. Liang, H; Zhang, Z; Hu, C; Gong, Y; Cheng, D. A survey on spatio-temporal big data analytics ecosystem: resource management, processing platform, and applications. IEEE Trans Big Data; 2023; 10,
9. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788
10. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27
11. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497
12. Jiao, L et al. Transformer Meets remote sensing video detection and tracking: A comprehensive survey. IEEE J Sel Top Appl Earth Observations Remote Sens; 2023; 16, pp. 1-45. [DOI: https://dx.doi.org/10.1109/JSTARS.2023.3289293]
13. Córdoba-Tlaxcalteco, ML; Benítez-Guerrero, EI. A systematic literature review on vision-based human event recognition in smart classrooms: identifying significant events and their applications. Trudy ISP RAN; 2024; 36,
14. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555
15. Yang Z, Zhang Y, Ning J, Wang X, Wu Z (2025) Early diagnosis of autism: A review of video-based motion analysis and deep learning techniques. IEEE Access. 13: 2903-2928
16. Daoud, Z; Ben Hamida, A; Ben Amar, C; Miguet, S. A one stream three-dimensional convolutional neural network for fire recognition based on spatio-temporal fire analysis. Evol Syst; 2024; 15,
17. Wei, R; Yan, R; Qu, H; Li, X; Ye, Q; Fu, L. SVMFN-FSAR: Semantic-guided video multimodal fusion network for few-shot action recognition. Big Data Min Analytics; 2025; 8,
18. Gan, C; Xiao, J; Zhu, Q; Jain, DK; Štruc, V. Transfer-learning enabled micro-expression recognition using dense connections and mixed attention. Knowl Based Syst; 2024; 305, 112640. [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112640]
19. Do, HN; Phan, HT; Nguyen, NT. Multimodal sentiment analysis using deep learning and fuzzy logic: A comprehensive survey. Appl Soft Comput; 2024; 167, 112279. [DOI: https://dx.doi.org/10.1016/j.asoc.2024.112279]
20. Alves, B; Brás, S; Sebastião, R. Decoding pain: prediction under different emotional contexts through physiological signals. Int J Data Sci Analytics; 2025; 19,
21. Ruhi SSF, Nahar F, Ashrafi AF (2024) A novel approach towards the classification of bone fracture from musculoskeletal radiography images using attention-based transfer learning. In: 2024 27th International Conference on Computer and Information Technology (ICCIT), vol. 1–6, pp. 517–522. IEEE
22. Tyagi V, Wellekens C (2005) On desensitizing the mel-cepstrum to spurious spectral components for robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–529
23. Murty, KSR; Yegnanarayana, B. Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process Lett; 2006; 13,
24. Li B (2014) Noise Robust Speech Recognition Using Deep Neural Network. Ph.D. Thesis, National University of Singapore
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.