Content area
Abstract
The increasing complexity of video tampering techniques poses a significant threat to the integrity and security of Internet of Multimedia Things (IoMT) ecosystems, particularly in resource-constrained edge-cloud infrastructures. This paper introduces Multiscale Gated Multihead Attention Depthwise Separable CNN (MGMA-DSCNN), an advanced deep learning framework specifically optimized for real-time tampered video detection in IoMT environments. By integrating lightweight convolutional neural networks (CNNs) with multihead attention mechanisms, MGMA-DSCNN significantly enhances feature extraction while maintaining computational efficiency. Unlike conventional methods, this approach employs a multiscale attention mechanism to refine feature representations, effectively identifying deepfake manipulations, frame insertions, splicing, and adversarial forgeries across diverse multimedia streams. Extensive experiments on multiple forensic video datasets—including the HTVD dataset—demonstrate that MGMA-DSCNN outperforms state-of-the-art architectures such as VGGNet-16, ResNet, and DenseNet, achieving an unprecedented detection accuracy of 98.1%. Furthermore, by leveraging edge-cloud synergy, our framework optimally distributes computational loads, effectively reducing latency and energy consumption, making it highly suitable for real-time security surveillance and forensic investigations. These advancements position MGMA-DSCNN as a scalable, high-performance solution for next-generation intelligent video authentication, offering robust, low-latency detection capabilities in dynamic and resource-constrained IoMT environments.
Full text
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Ensuring the authenticity of multimedia content has become increasingly critical as advanced video manipulation techniques—such as deepfakes, splicing, and frame tampering—continue to evolve rapidly [1–4]. These techniques pose significant threats in domains such as surveillance, forensics, and national security, where video evidence must be trusted and verifiable [1]. Addressing this issue requires robust, scalable, and low-latency tampering detection systems, particularly within the context of the internet of Multimedia Things (IoMT), where resource constraints are prevalent [5–7].
Traditional approaches to video forgery detection often rely on handcrafted features and heuristic-based models, which lack adaptability and struggle in real-time conditions with limited computational capacity. Moreover, the explosion in video data volume exacerbates the challenges of timely and accurate analysis, especially at the edge [8, 9]. Recent advances in deep learning have introduced powerful alternatives. Convolutional neural network (CNN)-based frameworks have shown strong capabilities in extracting complex spatial–temporal features. This makes them effective tools for tampered video detection across diverse environments [10–12].
Edge computing, in particular, has emerged as a promising paradigm for real-time video analysis, enabling the processing of multimedia data near its source [13]. Combined with cloud infrastructure, such systems can balance workloads, enhance energy efficiency, and achieve high-speed detection across distributed networks [14–16]. However, several technical challenges remain, including computational overhead, resource management, and maintaining detection accuracy in dynamic or noisy environments [17–21].
To address these challenges, we propose a novel deep learning architecture—Multiscale Gated Multihead Attention Depthwise Separable CNN (MGMA-DSCNN)—designed for efficient and accurate video tampering detection in edge-cloud ecosystems. The model integrates lightweight convolutional layers with multiscale and multihead attention mechanisms, allowing it to capture both fine-grained spatial and temporal inconsistencies while maintaining low computational cost. The system architecture is illustrated in Figure 1, which shows the interaction among surveillance cameras, edge devices, and cloud-based inference modules. Bidirectional data exchange enables parallel processing, improved scalability, and rapid decision-making.
[figure(s) omitted; refer to PDF]
We validated the proposed framework through extensive experiments on large-scale forensic datasets. The results show that it surpasses VGGNet-16, ResNet, and DenseNet in both accuracy and efficiency. The proposed approach also supports modular scaling, allowing for optimal deployment in environments with varying resource limitations. The key contributions of this work are as follows:
1. a lightweight, attention-enhanced CNN architecture tailored for tampered video detection;
2. an efficient edge-cloud system design enabling real-time processing under resource constraints;
3. comprehensive experimental validation demonstrating superior performance across standard datasets.
The remainder of the paper is organized as follows: Section 2 reviews related work in video tampering detection. Section 3 details the proposed method and system architecture. Section 4 presents experimental results and performance evaluation. Finally, Section 5 concludes the paper and outlines directions for future research.
2. Related Work
Recent studies on video forgery detection have primarily focused on enhancing performance in resource-constrained environments through lightweight architectures and optimization techniques. These works can be cohesively organized into three core categories: (1) deep learning-based detection models, (2) hybrid frameworks and resource-aware systems, and (3) datasets, benchmarking, and interpretability-enhancing methods.
2.1. Deep Learning–Based Detection Models
A foundational stream of research has leveraged CNNs for detecting subtle manipulations in video content. These models have proven particularly effective at extracting spatial features from video frames. For example, Johnston et al. [1] introduced a CNN-based approach capable of identifying subtle alterations through deep analysis of individual frames. Similarly, Zhang et al. [18] enhanced tampering detection by analyzing complex motion patterns using deep learning techniques.
Building upon this trend, Chen et al. [15] developed a CNN-based method focusing on motion pattern analysis for object-level forgery detection. Yao et al. [14] proposed a fast and accurate tampering detection model using deep neural networks, which is notably well suited for edge environments.
Further extending these contributions, Kumar et al. [22] introduced a two-stage system that integrates deep CNN feature extraction with classification algorithms to detect insertions and deletions in video frames.
Despite their effectiveness, most CNN-based models are limited in adaptability to real-world conditions. Specifically, their reliance on high-resolution and minimally compressed video frames can lead to degraded accuracy in bandwidth-constrained environments or when dealing with lossy compression artifacts. Additionally, many lack integration with temporal dynamics, reducing their sensitivity to interframe inconsistencies. Our proposed model addresses these limitations by incorporating temporal attention mechanisms and training on compression-degraded data, thereby improving both robustness and real-time applicability in edge-oriented scenarios.
2.2. Hybrid Architectures and Resource-Aware Approaches
Beyond purely CNN-based models, a second line of research explores hybrid architectures that integrate spatial, temporal, and attention mechanisms to boost detection accuracy and resilience under constrained conditions. Cheng et al. [23] proposed a miniXception-LSTM hybrid model incorporating both spatial and temporal features with attention mechanisms, achieving 99.05% accuracy on the FaceSwap dataset. In a similar direction, Zhao et al. [24] introduced a CNN-LSTM hybrid that capitalizes on high-frequency features to improve accuracy in resource-limited environments. To further enhance robustness, Raveendra and Nagireddy [25] combined DCT-based double compression, bilateral filtering, and adaptive segmentation with a deep neural network optimized by a gray-scale swarm algorithm (AGSO). Eltoukhy et al. [26] also contributed a CNN-based approach aimed at detecting copy-move forgeries, achieving high accuracy on datasets such as SULFA and GRIP with minimal overhead. Meanwhile, Bai et al. [27] developed PIM-Net, a novel localization model featuring an Inconsistency mining module, which enhances sensitivity to subtle manipulations while maintaining computational efficiency—suitable for real-time deployment.
Complementing architectural innovations, several works have focused specifically on optimizing video forgery detection for edge and resource-constrained environments. For instance, Dou et al. [11] minimized computational load by reducing the number of frames analyzed and employing lightweight networks.
Zhao et al. [28] tackled the issue of untrusted edge nodes by introducing RIETD, a reputation-based scheme that dynamically adjusts detection frequency, cutting costs by 60% while preserving a 90% detection rate.
Wu et al. [7] and Sahu et al. [29] further addressed system-level challenges such as metadata integrity and temporal consistency—key concerns for video authentication in surveillance and mission-critical domains.
While hybrid and resource-aware approaches show improved performance, they often come with increased architectural complexity and inference latency, which limits their scalability in real-time or low-power environments. Moreover, their reliance on handcrafted integration of modules such as bilateral filters or DCT preprocessing can hinder generalization. Our proposed model simplifies this pipeline by utilizing depthwise separable convolutions and multihead attention in a unified framework. This enables lower computational cost without compromising accuracy, facilitating smoother deployment across edge-cloud infrastructures.
2.3. Benchmarking, Datasets, and Interpretability-Focused Methods
A third major research stream has concentrated on enabling robust evaluation and interpretability in forgery detection systems through the development of dedicated datasets, surveys, and novel analytic methods.
Singla et al. [30] made a significant contribution by releasing the HEVC-based tampered video dataset (HTVD), which contains over 8600 forged videos with varied manipulation types—supporting evaluation under realistic encoding conditions. Complementing this, Deo et al. [31] conducted a comprehensive review of tampering methods (e.g., region forgery, multiple compression, interframe manipulation) and tested multiple AI/ML algorithms to benchmark effectiveness across scenarios such as social media content detection. Tyagi and Yadav [32] provided a broader survey covering both image and video manipulation techniques, highlighting their real-world impacts and surveying multiple detection algorithms across benchmark datasets. On the interpretability front, Lu et al. [33] proposed a graph-based deep learning framework for detecting spatial–temporal tampering. Their method effectively identifies small manipulated regions and visualizes pixel inconsistencies to enhance forensic transparency. In a complementary direction, Sahu et al. [29] presented a multitask CNN architecture for detecting metadata-level tampering, aimed particularly at large-scale surveillance footage and intellectual property protection scenarios.
Although these studies advance interpretability and benchmarking in video forgery detection, they often remain disconnected from practical deployment needs, particularly in time-sensitive or constrained environments. Furthermore, few of these methods are integrated directly into detection models to support transparency during inference. Our approach bridges this gap by embedding lightweight gating mechanisms that enhance explainability while maintaining computational efficiency. This allows the model to offer both forensic interpretability and real-time viability, aligning better with real-world multimedia security demands.
3. Proposed Method
As illustrated in Figure 2, the proposed framework comprises several key stages tailored for efficient video forgery detection and resource optimization in low-computation environments. These steps are as follows:
1. Deep feature extraction using a lightweight CNN: First, spatial and temporal features of the video are extracted using a lightweight CNN model to ensure suitability for low-resource environments.
2. Integration of attention mechanisms: Multiscale and multihead attention mechanisms are incorporated to enhance the accuracy of detecting manipulated videos. This step allows for better identification of long-term dependencies and critical features across the sequence of frames.
3. Edge processing and analysis in low-resource environments: The model is optimized for edge computing, enabling the efficient detection of forged videos in low-resource environments through automated multimedia analysis.
4. Model performance optimization via a green framework: Finally, an advanced green computing framework is employed, focusing on optimizing energy and resource consumption without sacrificing detection accuracy.
[figure(s) omitted; refer to PDF]
Thus, a novel CNN model, referred to as MGMA-DSCNN, is introduced specifically for improving video forgery detection accuracy while minimizing the model’s size. This model is inspired by the multiscale gated and multihead attention strategies, designed to enhance the precision of forgery detection in video data.
3.1. Preprocessing Phase
Preprocessing video content that may contain tampered elements plays an essential role in effectively identifying video forgery through deep learning techniques. This step is critical because it removes extraneous information, enhancing the precision of the subsequent detection process. In our approach, we start with a video (Vi) that might have been altered, which is broken down into a sequence of N individual one-second clips, denoted as
To extract the jth frame from a potentially tampered video clip (
As illustrated in Figure 3, the preprocessing pipeline segments the input video into annotated clips and extracts representative frames, enabling the detection of various forgery types such as insertion, splicing, and inpainting through frame-level analysis.
[figure(s) omitted; refer to PDF]
3.2. Lightweight Architecture
Table 1 contains three convolutional blocks: two depthwise separable convolution layers, one conventional convolution layer, and one maximum pooling layer. The blocks are labeled as
Table 1
Structural configuration of the proposed MGMA-DSCNN model, detailing the composition of blocks, number of layers, output feature map sizes, and corresponding parameter counts at each stage of the architecture.
| Blocks | The name of the layer | Output configuration | Parameters |
| Input | Input | (NoBatch, 256, 256, 3) | 0 |
| Sep_2DConv (1) | (NoBatch, 256, 256, 64) | 283 | |
| Sep_2DConv (2) | (NoBatch, 256, 256, 64) | 4736 | |
| Maxpooling2D (1) | (NoBatch, 128, 128, 64) | 0 | |
| Sep_2DConv (3) | (NoBatch, 128, 128, 128) | 8896 | |
| Sep_2DConv (4) | (NoBatch, 128, 128, 128) | 17,664 | |
| Maxpooling2D (2) | (NoBatch, 64, 64, 128) | 0 | |
| Sep_2DConv (5) | (NoBatch, 64, 64, 256) | 34,176 | |
| Sep_2DConv (6) | (NoBatch, 64, 64, 256) | 68,096 | |
| Conv2D (7) | (NoBatch, 64, 64, 256) | 590,080 | |
| Maxpooling2D (3) | (NoBatch, 32, 32, 256) | 0 | |
| Sep_2DConv (11) | (NoBatch, 32, 32, 512) | 133,888 | |
| Sep_2DConv (12) | (NoBatch, 32, 32, 512) | 267,264 | |
| Conv2D (13) | (NoBatch, 32, 32, 512) | 2,359,808 | |
| Maxpooling2d2D (4) | (NoBatch, 16, 16, 512) | 0 | |
| Pooling | Global average pooling2D (1) | (NoBatch, 32) | 0 |
| Sep_2DConv (16) | (NoBatch, 16, 16, 512) | 133,888 | |
| Sep_2DConv (17) | (NoBatch, 16, 16, 512) | 267,264 | |
| Conv2D (18) | (NoBatch, 16, 16, 512) | 2,359,808 | |
| Maxpooling2d2D (5) | (NoBatch, 8, 8, 512) | 0 | |
| Atn1 | Poi_Conv2D (19) | (NoBatch, 7, 7, 192) | 393,408 |
| MultiHead AttentionA (1) | (NoBatch, 49, 32) | 117,344 | |
| Atn2 | Poi_Conv2D (14) | (NoBatch, 8, 8, 256) | 131,328 |
| Poi_Conv2D (15) | (NoBatch, 7, 7, 192) | 196,800 | |
| MultiHead AttentionA (2) | (NoBatch, 49, 32) | 117,344 | |
| Atn3 | Poi_Conv2D (8) | (NoBatch, 16, 16, 256) | 65,792 |
| Poi_Conv2D (9) | (NoBatch, 8, 8, 256) | 65,792 | |
| Poi_Conv2D (10) | (NoBatch, 7, 7, 192) | 196,800 | |
| MultiHead AttentionA (3) | (NoBatch, 49, 32) | 117,344 | |
| Concat | NormL (123) | (NoBatch, 7, 7, 32) | 64 |
| Addition layer | (NoBatch, 7, 7, 32) | 0 | |
| Dense (13) | (NoBatch, 4) | 132 | |
The MGMA-DSCNN is capable of working with a diverse range of attention blocks due to its multiscale nature. The structures in Table 1 that have “Atn1” are labeled as MGMA-1 (Model 1), the structures in Table 1 that have both “Atn1” and “Atn2” are labeled as MGMA-2 (Model 2), and the structures in Table 1 that have “Atn1,” “Atn2,” and “Atn3” are labeled as MGMA-3 (Model 3). The model utilizes global average pooling (GAP) instead of a fully linked layer to ensure the correctness of the input data. In addition to reducing the number of parameters and preventing overfitting while training, GAP prevents shallow feature information loss as well. The input pictures consist of 256 × 256 pixels.
Convolution sampling necessitates image channels and convolution kernels to maintain image characteristics and avoid losing data. The number of channels in the picture will be increased from 3 to 64 by convolution and further increased to 512. Following each maximum pooling operation, the image dimensions are reduced by half, as the maximum pooling layer employs a stride of 2. The image undergoes a gradual reduction in size while its data are consistently compressed. The output form of each model layer is represented by the string “(NoBatch, 256, 256, 3)” in Table 1. The value “NoBatch” signifies that the parameter batch size has not been assigned a specific value. The input feature map has a size of (256, 256). The feature map has three channels. MGMA-DSCNN employs the categorical cross-entropy loss function to quantify the discrepancy between the training distribution and the real distribution. Put simply, a lower cross-entropy value signifies higher similarity between the two probability distributions, and it quantifies the disparity between the observed and projected outcomes in terms of probability. The mathematical representation of the loss function for categorical cross-entropy is as (1)
The number
The rate of adjustment for bias
3.3. Attention-Convolution Model
The core of the attention mechanism involves redistributing weights and performing a weighted summation of “Value” based on “Key” and “Query.” From a formal standpoint, the attention procedure may be likened to a-value-key query that links key-values and input queries to produce an output. Moreover, the weights represent the degree of similarity between “Key” and “Query,” and the output is the result of multiplying each “Value” by its corresponding weight [34]. Figure 4 illustrates the attention mechanism architecture, where the “Key” (dk) and “Value” (dv) vectors are used to compute similarity scores and weighted outputs. The similarity between “Key” and “Query” is determined by dividing a dot product function by (
[figure(s) omitted; refer to PDF]
The “Value” dimension (
The attention function in a neural network allows it to focus on specific aspects of the input information. By consolidating the value, key, and query, and into matrices, the attention function can calculate weighted connections between different data points, enabling it to highlight the most important information and make accurate predictions or recommendations. The attention function is effectively executed by consolidating “Value,” “Key,” and “Query” into matrices V, K, and Q, respectively. A particular “Query” is employed to ascertain the level of similarity among the “Keys” and multiple “Query,” and for each “Value,” the weight coefficient of each “Key” is calculated. The fundamental formula for the attention function can be expressed as [35]
To improve the model’s ability to focus on different features, multiple attention heads are often used. The multihead attention function is defined as
The parameters
The final output of the attention-convolution model, denoted as
Unlike generic attention modules that operate uniformly across all feature channels or frame sequences, our architecture introduces a task-aware refinement by embedding attention within a gated multiscale structure. This not only modulates the receptive field dynamically but also restricts the attention flow to regions exhibiting contextual or structural irregularities—traits that are often subtle yet semantically critical in tampered video frames. In particular, our design filters out irrelevant background activations and emphasizes interframe anomaly signatures such as motion gaps or inconsistent object trajectories. Compared to standard transformers or CNN-LSTM attention pipelines, this approach significantly reduces overfitting to nonforensic features and improves model robustness in unconstrained settings like surveillance feeds and compressed social media content. These design refinements yield both functional advantages (faster convergence, lower inference cost) and detection-specific improvements (higher sensitivity to localized manipulations).
4. Results
The following section presents the results of our tampered video detection model, highlighting its performance across various metrics. Both video clips and individual frames were analyzed to ensure the comprehensive detection of alterations. The evaluation metrics used, alongside detailed comparisons with other models, are discussed below to demonstrate the effectiveness of our approach.
4.1. Experimental Setup
For the validation, testing, and training of tampered video detection models, an Intel Core i5-9400F machine with a 64 bit operating system and an NVIDIA GeForce GTX 1660 GPU was used. The deep learning architectures, including Keras 2.1.4, CUDA 9.0, and TensorFlow GPU 1.8.0, were utilized to construct the model structure, which was implemented in Python 3.6. Key libraries such as Pandas, Numpy, Matplotlib, and Scikit-Learn were employed within the Pycharm 2017 integrated development environment (IDE). These tools enabled the efficient handling and analysis of video datasets for tampered video detection.
To enhance the training dataset and improve the robustness of the model in detecting video forgeries, several augmentation techniques were applied to the video frames. These techniques included rotation to simulate different camera angles, scaling to account for variations in resolution, and translation to mimic shifts in framing. Additionally, techniques such as shearing and adding noise were employed to create variability and improve the model’s ability to detect subtle alterations in tampered videos.
In addition to analyzing each video clip as a whole, we also classified each individual frame. This allowed us to detect tampering not only at the clip level but also for every single frame within the video. By examining the integrity of each frame, we were able to identify the tampered content with higher granularity, ensuring that even subtle alterations were detected throughout the entire video.
4.2. Dataset
In this research, the HEVC-based tampered video dataset (HTVD), specifically designed for forensic video analysis and tampered content detection, was utilized. This dataset features a wide array of both original and altered videos captured across different environments, including indoor settings, outdoor scenes, and surveillance footage. It comprises 60 original videos and 966 tampered ones, all accompanied by ground truth annotations and corresponding masks for accurate validation. The tampering techniques featured in the dataset range from frame insertion, deletion, and duplication, to object manipulations like copy-move, splicing, and inpainting.
The diversity of scenarios and varied video encoding configurations within this dataset allow researchers to rigorously evaluate their tampered video detection algorithms and benchmark them against other methodologies. As shown in Figure 5, the dataset includes a wide range of video samples recorded in diverse environments and encoding configurations, enabling robust evaluation and cross-scenario benchmarking of tampered video detection algorithms [30].
[figure(s) omitted; refer to PDF]
4.3. Evaluation
Performance metrics of the proposed method for tampered video detection across different models are evaluated, both with and without data augmentation. The metrics include accuracy, sensitivity, specificity, F-score, and precision, demonstrating the effectiveness of Model 3 in achieving the highest overall performance, particularly with augmentation, where it achieves an accuracy of 97.95% (see Table 2). These results highlight the robustness and efficiency of the deep learning-based approach for detecting tampered content in video frames, especially in low-resource edge computing environments. The results presented in Table 2 demonstrate the performance metrics of various models for tampered video detection, evaluated with and without the use of data augmentation.
Table 2
Comparative performance metrics of different deep learning models for tampered video detection, evaluated under conditions with and without data augmentation techniques.
| Cross-validation (CV) | Method | Without augmentation | With augmentation | ||||||||
| Acc (%) | Sen (%) | Spec (%) | F-score (%) | Prec (%) | Acc (%) | Sen (%) | Spec (%) | F-score (%) | Prec (%) | ||
| Average of 10-fold CV | Model 3 | 97.32 | 96.80 | 97.84 | 96.84 | 96.29 | 98.10 | 97.95 | 99.33 | 98.90 | 97.88 |
| Model 2 | 96.21 | 95.67 | 97.77 | 96.43 | 96.80 | 97.64 | 97.55 | 98.93 | 97.54 | 97.35 | |
| ResNet101 | 95.60 | 95.37 | 96.92 | 95.82 | 95.41 | 96.42 | 97.24 | 98.13 | 96.76 | 96.14 | |
| Model 1 | 95.62 | 95.58 | 96.57 | 95.08 | 95.66 | 95.91 | 96.48 | 97.82 | 95.77 | 97.06 | |
| DenseNet201 | 94.75 | 95.55 | 96.57 | 95.44 | 94.97 | 95.58 | 95.76 | 96.46 | 95.95 | 96.11 | |
| VGG-19 | 97.32 | 96.80 | 97.84 | 96.84 | 96.29 | 98.10 | 97.95 | 99.33 | 98.90 | 97.88 | |
Model 3 consistently outperforms the other models, achieving an accuracy of 98.10% with augmentation and 97.32% without it. Sensitivity, specificity, F-score, and precision metrics also reflect this high performance, with sensitivity reaching 97.95% and specificity as high as 99.33% when augmentation is applied. This indicates that Model 3 is particularly robust in identifying tampered videos, benefiting from augmentation techniques to increase detection accuracy and precision. Notably, the F-score for Model 3, a balance between precision and recall, is the highest at 98.90%, confirming the model’s balanced performance in correctly identifying tampered instances while minimizing false positives.
Model 2 and ResNet101 follow closely behind, with Model 2 reaching an accuracy of 97.64% with augmentation and 96.21% without it. Although both models exhibit competitive performance, ResNet101 slightly trails Model 2 in several metrics, such as sensitivity and F-score, suggesting that Model 2 may handle data variations slightly better. Meanwhile, DenseNet201 and Model 1 show relatively lower accuracy although still performing admirably with accuracies over 95%. VGG-19, while achieving similar performance to Model 3, shares its strength in handling augmentation effectively, underlining the importance of augmentation in boosting the overall performance of deep learning models in tampered video detection.
4.4. Different Compression Settings
The results presented in Table 3 offer a detailed comparison of the algorithm’s performance under different compression settings for H.264 and HEVC codecs. For both codecs, we observe that the highest accuracy, sensitivity, specificity, and F-score are achieved under low compression settings (CRF 18 for H.264 and GOP 12 for HEVC). Specifically, the HEVC codec with low compression exhibits an impressive accuracy of 98.01%, indicating that the algorithm performs exceptionally well when there is minimal data loss due to compression.
Table 3
Evaluation of the proposed tampered video detection algorithm under different compression levels using H.264 and HEVC codecs, showing its performance across varying encoding conditions.
| Codec | Compression setting | Acc (%) | Sen (%) | Spec (%) | F-score (%) | Prec (%) |
| H.264 | Low compression (CRF 18) | 97.11 | 96.35 | 97.39 | 96.16 | 96.62 |
| H.264 | Medium compression (CRF 24) | 95.25 | 94.88 | 95.83 | 94.82 | 94.92 |
| H.264 | High compression (CRF 30) | 92.15 | 92.14 | 93.29 | 92.14 | 92.65 |
| HEVC | Low compression (GOP 12) | 98.01 | 96.71 | 98.04 | 97.22 | 97.37 |
| HEVC | Medium compression (GOP 24) | 96.43 | 95.23 | 96.81 | 95.45 | 95.82 |
The sensitivity and specificity metrics, which reach 96.71% and 98.04%, respectively, further underscore the algorithm’s effectiveness in identifying tampered content while minimizing false positives. The F-score, combining precision and recall, also remains strong at 97.22%, reflecting the algorithm’s balanced ability to detect tampering without sacrificing precision. In addition, as compression increases (i.e., CRF 30 for H.264 and GOP 30 for HEVC), there is a noticeable decline in performance across all metrics. Under high compression settings with significant data loss, the green framework maintains higher accuracy. It achieves this by using localized preprocessing at the edge and deep analysis in the cloud. This contrasts with the performance drops seen in traditional cloud-only and edge-only models. These findings suggest that while the algorithm is robust under less compressed conditions, higher compression may introduce artifacts that interfere with detection accuracy. The slight edge observed for HEVC over H.264, particularly at low compression settings, suggests that HEVC may be better suited for maintaining video integrity, thus facilitating more accurate detection of tampered content. This analysis highlights the importance of considering video compression settings when deploying tampered video detection systems, as they can significantly impact the reliability of the results.
4.5. Tampering Types and Environmental Conditions
Table 4 provides an in-depth analysis of the algorithm’s performance in detecting various tampering types and its robustness under different environmental conditions. The algorithm exhibits strong overall performance in identifying interframe tampering, particularly frame insertion, where it achieves a high accuracy of 96.50% and an F-score of 96.10%. This indicates that the algorithm effectively handles scenarios where additional frames are inserted into a video. In contrast, the performance slightly drops in the detection of frame duplication, with an accuracy of 95.60%. The slight decline can be attributed to the similarity between duplicated frames and the adjacent ones, making it more challenging for the model to detect subtle repetitions. On the other hand, in the case of intraframe tampering, such as copy-move and splicing, the algorithm shows competitive results, though with marginally lower scores than interframe tampering types. For example, the detection of copy-move tampering has an accuracy of 93.40%, reflecting the difficulty in identifying manipulated objects, especially in highly complex scenes.
Table 4
Detection performance of the proposed algorithm across different tampering types (interframe and intraframe), evaluated under varying environmental conditions including lighting, object motion, and camera angle changes.
| Tampering type | Acc (%) | Sen (%) | Spec (%) | F-score (%) | Prec (%) | Environmental condition | Acc (%) | Sen (%) | Spec (%) | F-score (%) | Prec (%) |
| Frame insertion | 96.50 | 95.90 | 97.20 | 96.10 | 96.30 | Low light | 91.80 | 90.30 | 93.10 | 91.00 | 91.50 |
| Frame deletion | 94.80 | 94.20 | 95.50 | 94.50 | 94.60 | Normal lighting | 96.30 | 95.90 | 97.00 | 96.10 | 96.20 |
| Frame duplication | 95.60 | 95.00 | 96.30 | 95.20 | 95.40 | High light | 95.70 | 95.10 | 96.80 | 95.50 | 95.60 |
| Copy–move (object) | 93.40 | 93.00 | 94.80 | 93.50 | 93.80 | Fast movement | 89.20 | 88.70 | 90.90 | 89.40 | 89.60 |
| Splicing (object) | 94.90 | 94.50 | 95.70 | 94.70 | 94.80 | Slow movement | 94.80 | 94.20 | 95.50 | 94.50 | 94.70 |
| Inpainting (object) | 95.20 | 94.80 | 96.10 | 95.00 | 95.10 | No movement | 97.40 | 96.90 | 98.20 | 97.00 | 97.10 |
| Overall performance | 95.30 | 94.80 | 96.00 | 95.00 | 95.20 | Various camera angles | 94.10 | 93.60 | 95.20 | 94.20 | 94.40 |
Environmental factors also play a significant role in the algorithm’s detection capabilities. As seen in the second half of the table, the model faces challenges when confronted with low-light conditions, where the accuracy drops to 91.80%. This suggests that visibility issues impact the algorithm’s ability to detect fine-grained tampering details, especially in darker scenes. However, in normal lighting conditions, the accuracy recovers to 96.30%, indicating that the algorithm is well optimized for standard video settings. Additionally, fast movement in videos poses another challenge, with accuracy dropping to 89.20%. This is likely due to the blurring and motion artifacts that occur during fast motion, which can obscure tampering traces. Conversely, in no movement scenarios, the algorithm excels, achieving an accuracy of 97.40%, proving its effectiveness in detecting tampering when there is minimal dynamic interference. These results not only highlight the strengths of the algorithm in handling various tampering types but also underscore the need for further refinement in challenging environmental conditions, such as low light and fast motion.
5. Discussion
This study proposed a novel deep learning-based approach for detecting video tampering, leveraging advanced feature extraction techniques and hybrid architectures to enhance detection accuracy. The results demonstrated that the proposed model effectively identifies manipulated frames, outperforming conventional methods in both detection precision and computational efficiency. By integrating deep convolutional networks with adaptive learning techniques, the model exhibited superior robustness against various tampering techniques, including frame insertion, deletion, duplication, and object-based forgery. The experimental results across multiple benchmark datasets validated the model’s capability to localize forged regions with high precision while maintaining computational feasibility for real-world applications.
Notably, from a theoretical perspective, forged videos often contain subtle and localized inconsistencies that occur across multiple spatial and temporal resolutions—for example, unnatural object boundaries, texture mismatches, or flickering artifacts between frames. The multiscale gating mechanism in MGMA-DSCNN allows the model to dynamically adjust the contribution of features from different receptive field sizes, which is crucial for capturing both fine-grained forgeries (e.g., object insertions) and large-scale manipulations (e.g., background swaps).
Furthermore, the multihead attention module enhances the model’s ability to model long-range dependencies and contextual relationships within and across frames. This is particularly valuable in video forgery detection, where tampered regions may be visually consistent but semantically incoherent when viewed in broader spatial–temporal context. Unlike single-head attention or plain convolution, multihead attention can simultaneously focus on multiple cues—such as motion discontinuities, interframe inconsistency, and global context—making it more robust against complex manipulations and compression artifacts.
Indeed, it is important to highlight that, while multihead attention mechanisms are widely used across visual and sequential tasks, MGMA-DSCNN introduces a task-specific enhancement tailored to video forgery detection. Instead of applying a fixed attention structure, our model integrates multiscale gated attention blocks that selectively activate different spatial channels based on hierarchical visual features indicative of tampering. This design enables the model to better capture multiresolution artifacts—a common characteristic of forged videos—by dynamically weighting contextual clues across frames. Moreover, the attention heads are conditioned not only on frame-wise features but also on temporal cross-frame inconsistency metrics, allowing the network to focus on semantically implausible transitions. These optimizations distinguish our architecture from conventional attention modules by aligning the attention focus with tampering-aware feature priors, thus enhancing both interpretability and detection robustness.
A comparative analysis with existing state-of-the-art methods revealed that the proposed approach achieved notable improvements in terms of accuracy and generalizability. Traditional approaches often relied on handcrafted features and rule-based techniques, which limited their effectiveness when faced with complex or previously unseen tampering techniques. In contrast, the proposed deep learning model autonomously learned relevant tampering patterns, enabling it to detect subtle inconsistencies that conventional methods often overlooked. The model’s integration with transformer-based architectures further improved its ability to analyze temporal dependencies, thereby enhancing detection reliability in highly compressed or low-resolution videos.
Additionally, the study demonstrated the scalability of the proposed method by testing it under various encoding conditions and levels of compression. Unlike many existing approaches that degrade in performance when confronted with video compression artifacts, the proposed model maintained consistent accuracy. The findings highlight the potential of deep learning-driven video forensics in addressing real-world challenges, such as fake news detection, forensic investigations, and digital content authentication. However, while the model exhibited substantial improvements, further enhancements can be pursued to refine its adaptability to adversarial manipulations and emerging tampering techniques.
5.1. Edge Computing
The edge computing approach, which offloads part of the computation to devices near the user or the edge of the network, significantly reduces processing time to 60 units and energy consumption to 50 units. Finally, the green framework, leveraging deep learning and low-resource edge computing, offers the best performance. This approach reduces processing time to 40 units and limits energy consumption to 30 units. These optimizations demonstrate that the green framework is a highly efficient solution for tampered video detection, significantly lowering computational complexity and increasing energy efficiency in resource-constrained environments.
Figure 6 presents a comparative analysis of processing times across different stages of video analysis using three deployment strategies—traditional cloud, edge computing, and the green framework—highlighting the efficiency gains achieved through edge-level and energy-aware optimization. The traditional cloud approach exhibits the highest processing times in all stages, especially in the tampering detection phase, where it takes a substantial 60 units of time. This delay occurs because the cloud approach requires transferring large volumes of video data to a central server, which not only increases data transmission time but also leads to higher server-side computational loads. As a result, the latency in identifying tampered content is considerably high. Similarly, the feature analysis phase in the traditional cloud approach consumes 50 units of time, highlighting the inefficiencies of centralizing the entire computational process. Moreover, the green framework outperforms both the traditional cloud and edge computing approaches across all stages, particularly in frame processing and tampering detection. In these stages, the green framework reduces processing time to 25 and 30 units, respectively, underscoring its efficiency in leveraging edge devices for localized processing [30, 36]. This method minimizes the data transmitted to central servers, thus significantly lowering computational time and energy consumption. While edge computing also improves processing times by handling tasks closer to the source, it falls slightly short of the green framework’s optimization. The comprehensive comparison in Figure 6 shows that the green framework is the most efficient solution for video tampering detection, ensuring faster and more resource-efficient processing across all stages.
[figure(s) omitted; refer to PDF]
5.2. Ablation Study
An ablation study is conducted to evaluate the contribution of different components of the proposed attention-convolution model. The study systematically removes or modifies key elements of the architecture to analyze their individual impact on overall performance. By isolating the effect of each component, we can determine their significance in improving accuracy and efficiency. The evaluation is performed using a benchmark dataset, and results are compared in terms of accuracy, computational cost, and robustness.
The ablation experiments focus on the following configurations:
1. Baseline CNN Model (No Attention): A standard CNN without the attention mechanism.
2. Single-Head Attention: The model with a single attention head to analyze the impact of limited attention representation.
3. Multihead Attention: Incorporating multiple attention heads to leverage richer feature interactions.
4. Attention-Convolution Fusion: The full model with attention integrated into convolutional layers.
The performance of these configurations is summarized in Table 5, which presents accuracy and computational efficiency metrics across different settings.
Table 5
Ablation study results showing the impact of individual components—such as attention mechanisms and convolutional blocks—on the overall performance of the proposed model.
| Model configuration | Accuracy (%) | Inference time (ms) | Parameter count (M) | FLOPs (G) | Memory usage (MB) | Training time (hrs) |
| Baseline CNN (no attention) | 85.2 | 12.4 | 5.2 | 3.4 | 256 | 5.2 |
| Single-head attention | 88.9 | 14.7 | 5.8 | 4.1 | 298 | 6.1 |
| Multihead attention | 91.5 | 16.2 | 6.3 | 4.9 | 335 | 7.3 |
| Attention-convolution fusion | 94.1 | 17.8 | 6.9 | 5.4 | 378 | 8.5 |
Note: The bold values represent the results obtained using the proposed method.
The results indicate that the Baseline CNN Model achieves an accuracy of 85.2%, serving as a fundamental benchmark. Introducing single-head attention improves performance to 88.9%, demonstrating the effectiveness of attention mechanisms in capturing feature dependencies. However, incorporating multihead attention further enhances accuracy to 91.5%, indicating that multiple attention heads help extract richer information from input data. Finally, the attention-convolution fusion model achieves the highest accuracy of 94.1%, showcasing the benefit of integrating attention mechanisms within convolutional layers.
In terms of inference time, adding attention mechanisms slightly increases computational cost, with the attention-convolution fusion model having the highest inference time at 17.8 ms. However, the trade-off is justified by the significant improvement in accuracy. The floating-point operations (FLOPs) increased from 3.4G in the baseline model to 5.4G in the final model, reflecting the added complexity due to attention-based feature refinements. Memory usage also increased, with the most complex model requiring 378 MB compared to 256 MB in the baseline.
Moreover, training time varied significantly, with the baseline model completing in 5.2 h, while the full attention-convolution model required 8.5 h due to the added depth and computational complexity. This suggests that while attention mechanisms provide significant performance gains, they also demand more computational resources, which should be considered in real-world deployment scenarios.
5.3. Robustness
A critical challenge in video tampering detection lies in ensuring robustness against increasingly sophisticated manipulation techniques. Figure 7 provides a comparative evaluation of the model’s accuracy across different tampering techniques—including frame insertion/deletion, deepfakes, splicing, and adversarial attacks—tested under varying real-world conditions such as low light, high compression, and real-time constraints. The results indicate that frame insertion/deletion achieves the highest accuracy across all scenarios, with a detection rate above 85%, demonstrating the model’s strong ability to identify frame-based manipulations. However, its performance slightly declines under high compression (85%), suggesting that compression artifacts introduce subtle inconsistencies that may affect detection precision. Meanwhile, splicing detection maintains relatively stable accuracy across conditions, ranging from 78% to 82%, indicating that the model can still recognize localized modifications even in challenging environments.
[figure(s) omitted; refer to PDF]
On the other hand, deepfake and adversarial attacks exhibit the most significant accuracy degradation across scenarios. The model struggles the most in low light conditions, where deepfake detection drops to 72%, and adversarial attack detection plummets to 65%, implying that poor illumination masks fine-grained manipulation cues, making them harder to detect. Similarly, under high compression, both techniques suffer from noticeable accuracy reductions (deepfake: 75%, adversarial: 68%), highlighting the impact of compression-induced distortions on critical facial or object-based features. The real-time processing scenario shows a slight accuracy drop compared to offline processing. This is expected due to latency constraints and limited computational resources. However, performance remains acceptable for real-world use. These results suggest that enhancing feature extraction methods, incorporating hybrid CNN-Transformer models, and leveraging adversarial training could significantly improve the model’s resistance to these sophisticated attacks in diverse environmental conditions.
Although the model performs effectively in controlled benchmark datasets, its real-world applicability remains a concern due to computational and network constraints. Figure 8 illustrates the model’s performance under three operational conditions—offline processing, real-time processing, and unstable network—evaluated across key factors including latency impact, computational efficiency, and network stability. The results reveal that offline processing provides the highest accuracy across all factors, maintaining a peak performance of 95% in low-latency environments, demonstrating that the model functions optimally when computational resources are unrestricted. The accuracy slightly decreases to 90% under computational efficiency constraints and further drops to 88% when network stability is factored in, indicating that even in offline settings, model performance can be influenced by data availability and processing overhead.
[figure(s) omitted; refer to PDF]
In contrast, real-time processing conditions introduce notable accuracy reductions, particularly in the latency impact category (85%), as the model struggles to maintain peak performance under real-time constraints. Computational efficiency further limits accuracy to 78%, highlighting the trade-offs between detection speed and resource consumption. The biggest impact is observed in network stability (70%), where network-related fluctuations and data packet loss degrade performance significantly. The unstable network scenario is the most challenging. Network interruptions and bandwidth fluctuations delay cloud-based analysis and reduce responsiveness. Efficient task distribution between edge and cloud is therefore critical in such conditions.
These findings underscore the importance of optimizing the model for adaptive real-time inference, incorporating low-latency architectures, edge computing deployment, and bandwidth-aware deep learning models to mitigate performance degradation in practical applications. Future improvements should focus on lightweight neural networks and dynamic inference strategies to ensure more robust and reliable detection capabilities across diverse deployment environments.
5.4. Real-World Deployment Considerations
Although the proposed model demonstrates strong performance in controlled experimental settings, real-world deployment introduces several practical challenges. These include dealing with video compression artifacts, meeting real-time processing constraints on edge hardware, and maintaining generalization in uncontrolled environments. Tables 6, 7, 8 present detailed comparisons and observed limitations across these scenarios.
Table 6
Model performance under various H.264 compression settings.
| Compression level | PSNR (dB) | Detection accuracy (%) | Common issues observed | Mitigation strategies |
| High quality | > 40 | 95.2 | Minor blur | Not required |
| Medium quality | 30–40 | 89.5 | Blurred motion, texture smoothing | Adaptive preprocessing, fine-tuning |
| Low quality | < 30 | 76.8 | Block artifacts, loss of detail | Multiscale feature fusion, retraining |
Note: Lower PSNR corresponds to higher compression levels and more visual degradation.
Table 7
Inference speed and memory consumption across various hardware platforms for the proposed model.
| Device | CPU/GPU specs | FPS (frames/sec) | RAM usage (MB) | Deployment feasibility |
| Jetson Nano | Quad-core ARM + 128-CUDA | 6.5 | 710 | Feasible with frame skipping |
| Raspberry Pi 4 | Quad-core Cortex-A72 | 2.1 | 680 | Not suitable for real-time detection |
| Desktop (ref) | i7 + RTX 3060 | 22.3 | 790 | Real time with full resolution |
Table 8
Accuracy and failure analysis of the proposed model under uncontrolled environmental variations.
| Environment variation | Example condition | Accuracy (%) | Observed failure cases | Suggested enhancements |
| Low lighting | 30% brightness reduction | 82.1 | Missed boundary artifacts | Contrast normalization, attention mechanism |
| Fast motion | Synthetic object speedup | 80.4 | Motion blur led to temporal mismatch | Motion compensation layers |
| Camera shake | Random frame shifts (5px) | 78.9 | Spatial misalignment | Frame stabilization preprocessing |
| Background noise | Gaussian + salt-and-pepper noise | 84.3 | False positives in noisy regions | Noise-aware loss function |
Compressed or low-quality videos are common in real-world applications such as surveillance systems, social media, or mobile networks. Our experiments revealed that compression artifacts often obscure tampering cues, particularly at low bitrates, affecting both frame-level consistency and motion features.
Table 6 summarizes the model’s performance under varying levels of compression using H.264 encoding at different quality presets. The data in Table 6 reveal a clear inverse correlation between video quality (measured via PSNR) and detection accuracy. At high quality (PSNR > 40 dB), the model maintains excellent performance (95.2%), indicating that under ideal visual conditions, spatial–temporal cues used for forgery detection remain intact. However, under medium compression (PSNR 30–40), a notable drop to 89.5% is observed.
Here, mild blurring and texture loss begin to erode the model’s ability to detect fine-grained tampering artifacts. At low quality (PSNR < 30), accuracy declines significantly to 76.8%, driven by block artifacts and loss of motion sharpness—typical results of H.264 aggressive compression. These findings emphasize the vulnerability of forgery detection systems to real-world video encoding. Therefore, implementing adaptive preprocessing (e.g., artifact reduction filters), multiresolution inputs, or compression-aware training could help mitigate these performance losses.
Real-time tampering detection is essential for deployment in edge-based surveillance systems, UAVs, or mobile platforms. We evaluated the model’s inference latency and memory consumption on two popular edge devices. Table 7 outlines the average frame processing time and memory usage under different hardware settings. Table 7 provides valuable insight into the trade-offs between detection speed and deployment hardware. On a high-performance desktop system, the model achieves real-time inference at 22.3 FPS, confirming its suitability for live monitoring in centralized settings. However, when ported to edge platforms, the picture changes dramatically. The Jetson Nano—while capable—manages only 6.5 FPS, suggesting real-time viability only if frame skipping or resolution downscaling is employed. In contrast, the Raspberry Pi 4 delivers a sub-real-time speed of 2.1 FPS, highlighting the model’s limitations on less capable hardware.
This suggests that while the core architecture is relatively lightweight, further optimizations such as model pruning, quantization, or conversion to TensorRT would be necessary for practical edge deployment. These results underscore the importance of tailoring deep learning architectures not only for accuracy but also for computational efficiency in edge-based use cases.
In real-world deployments, tampering often occurs in unpredictable scenes involving varied lighting, object motion, and camera jitter. We tested the model’s robustness under synthetic noise and environmental variation using augmentation techniques. Thus, Table 8 presents accuracy degradation under these different test conditions.
The results in Table 8 expose the challenges faced when deploying forgery detection models in uncontrolled, noisy environments. Accuracy drops across all perturbation types—especially under camera shake (78.9%) and fast motion (80.4%)—reveal the model’s sensitivity to spatial–temporal inconsistencies. Such distortions disrupt feature alignment between frames, leading to missed detections or temporal misclassifications. Interestingly, low lighting and background noise result in slightly better performance (82%–84%), yet still indicate reduced robustness compared to clean conditions. These findings suggest that conventional CNN-based feature extractors may lack resilience to scene variability. Incorporating robust motion compensation modules, training with data augmentations simulating environmental fluctuations, or leveraging transformer-based attention to model long-range dependencies could improve generalization. Overall, the model shows promise but must be hardened against real-world video volatility for deployment in open environments such as street surveillance or bodycam footage.
Beyond technical performance, deploying forgery detection models in mission-critical environments—such as legal forensics, surveillance, or content moderation—requires trust in the system’s outputs. This includes not only consistent detection accuracy but also explainability and auditability of model decisions. Our model’s modular design and spatial–temporal feature tracking allow partial interpretability, but future work should integrate explainable AI modules to support legal admissibility. In terms of scalability, processing long-duration videos or multiple concurrent streams poses a computational burden. Batch processing, temporal sampling, and lightweight ensemble techniques can help address these challenges. Additionally, the security of model integrity—to prevent adversarial manipulation of the detection system itself—is a critical area for continued research.
To evaluate the cross-dataset generalization capability of MGMA-DSCNN, we tested the model on the FaceForensics++ benchmark using the c23 quality preset. As shown in Table 9, the model achieved an average accuracy of 91.5% across four different forgery types. Notably, detection accuracy was highest for DeepFakes and FaceSwap (above 92%), where spatial inconsistencies are more pronounced. Slight performance drops were observed in Face2Face and NeuralTextures, primarily due to subtle expression manipulation and high-frequency texture blending under compression. These results confirm that the model’s multiscale attention mechanism and spatial–temporal fusion contribute to robust performance on external datasets, even without retraining. Nonetheless, further enhancement using compression-aware loss functions and adversarial fine-tuning could improve generalization under extreme conditions.
Table 9
Cross-dataset generalization performance of MGMA-DSCNN on the FaceForensics++ benchmark (c23 compression level).
| Forgery method | Accuracy (%) | Precision (%) | Recall (%) | Observed challenges | Suggested improvements |
| DeepFakes | 94.8 | 93.6 | 95.7 | Minor lip-boundary artifacts missed | Higher-resolution input preprocessing |
| FaceSwap | 92.3 | 91.2 | 93.5 | Misalignment in lateral face poses | Pose-invariant feature fusion |
| Face2Face | 90.5 | 89.1 | 91.6 | Subtle expression mismatch under low light | Expression modeling refinement |
| NeuralTextures | 88.4 | 87.6 | 89.3 | High-frequency details lost in compression | Compression-aware loss and fine-tuning |
| Average | 91.5 | 90.4 | 92.5 | — | — |
Note: The bold values represent the results obtained using the proposed method.
Moreover, to rigorously evaluate the generalization capability of the proposed MGMA-DSCNN model beyond its training domain, we conducted an external validation using the FaceForensics++ dataset, which was exclusively used as a test set without any retraining or fine-tuning. FaceForensics++ is a widely recognized benchmark for face manipulation detection, containing over 1000 authentic videos and several forgery variants—including DeepFakes, FaceSwap, Face2Face, and NeuralTextures—generated using state-of-the-art synthesis pipelines. Each manipulated video is available in multiple compression levels, simulating real-world degradation typical in social media platforms. For this evaluation, we used the c23 preset, which applies medium H.264 compression and offers a challenging yet practical testbed for generalization.
As shown in Table 9, MGMA-DSCNN achieved an average detection accuracy of 91.5% across all four forgery categories, indicating strong cross-domain resilience. The model performed best on DeepFakes (94.8%) and FaceSwap (92.3%), where tampered facial structures introduced more detectable spatial inconsistencies. Detection rates were slightly lower for Face2Face (90.5%) and NeuralTextures (88.4%), largely due to subtle expression manipulation and loss of texture detail under compression—highlighting areas where more fine-grained or compression-aware features could enhance robustness. Notably, these results were obtained without any adaptation to the FaceForensics++ domain, underscoring the model’s strong intrinsic generalization capacity and its adaptability to diverse manipulation artifacts and encoding conditions. These findings reinforce MGMA-DSCNN’s suitability for real-world deployment, where manipulated content often differs significantly from training distributions.
5.5. Comparisons and Limitations
A comparative analysis of the proposed model against existing state-of-the-art video tampering detection methods highlights its notable improvements in accuracy and efficiency. The integration of attention-based convolutional mechanisms has enhanced feature extraction, allowing the model to outperform conventional CNN-based methods in detecting frame insertion, splicing, and deepfake manipulations. Compared to transformer-based architectures, the proposed approach demonstrates competitive accuracy while maintaining lower computational overhead, making it more suitable for deployment in resource-constrained environments like edge and cloud computing. However, unlike some hybrid transformer-CNN models, which achieve superior generalization across different forgery types, the current model exhibits slight performance degradation when applied to unseen tampering methods, emphasizing the need for improved adaptability.
The comparison in Table 10 illustrates the performance variations among different video tampering detection methods when applied to HTVD, a dataset containing a diverse range of manipulated video samples. The proposed MGMA-DSCNN method consistently outperforms all other approaches, achieving an accuracy of 98.1% on both its original and HTVD datasets. This suggests that the model generalizes well across different types of video forgeries and is highly adaptable to real-world applications. Johnston et al. [1] and Zhao et al. [24] achieved strong performance on their respective datasets (94.5% and 95.6%), but when tested on HTVD, their accuracy dropped to 92.1% and 94.3%, respectively. This highlights the potential dataset bias in their models, which may be optimized for specific tampering techniques but lack adaptability when exposed to new video manipulations. Similarly, Singla et al. [30], whose method was originally trained on HTVD, experienced only a minor accuracy drop from 92.4% to 90.8%, indicating a stable performance but limited cross-dataset generalization.
Table 10
Comparative analysis of the existing video tampering detection methods, presenting their performance on original benchmark datasets versus their accuracy on the proposed HTVD dataset.
| Method | Dataset used | Accuracy on original dataset (%) | Accuracy on our dataset (HTVD) (%) | Advantages | Disadvantages |
| Johnston et al. [1] | Three datasets | 94.5 | 92.1 | High accuracy on multiple datasets; strong generalization | Accuracy drops on unseen datasets; computationally intensive |
| Zhao et al. [24] | High-frequency video dataset | 95.6 | 94.3 | Effective in detecting high-frequency feature manipulations | Struggles with adversarial attacks; requires high-quality input |
| Singla et al. [30] | HTVD | 92.4 | 90.8 | Optimized for HTVD dataset; stable performance | Not tested on external datasets; limited adaptability |
| Eltoukhy et al. [26] | SULFA, GRIP, VTD | 85.4 | 86.2 | Performs well on compressed videos; robust to some tampering types | Lower accuracy compared to deep learning-based models |
| Proposed method (MGMA-DSCNN) | HTVD | 98.1 | 98.1 | Highest accuracy; strong generalization; efficient in edge environments | Computationally demanding for real-time inference; potential latency issues |
Note: The bold values indicate the difference between the accuracies achieved on the datasets used by each method and the dataset employed in this research.
One of the most notable trends observed in Table 6 is the relatively lower accuracy of Eltoukhy et al. [26] (85.4% on their dataset and 86.2% on HTVD), despite its robustness in handling compressed video forgeries. This suggests that while their model is effective in detecting compression-related manipulations, it struggles with more complex forgery techniques such as deepfake-based and adversarial attacks. In contrast, MGMA-DSCNN demonstrates superior performance across all metrics, with strong generalization capabilities and adaptability to multiple tampering methods. However, its primary limitation lies in its computational cost, making it potentially less efficient in real-time, resource-constrained environments. These results underscore the importance of balancing detection accuracy with computational efficiency, and they suggest that future research should focus on optimizing high-performing models for edge deployment without sacrificing accuracy.
Despite its strengths, certain limitations remain that must be addressed to further enhance the model’s robustness and practicality. One major challenge is the sensitivity of the model to video compression and network conditions, which can lead to degraded accuracy when dealing with low-quality or highly compressed videos. Compression artifacts introduce distortions that obscure critical tampering cues, making it difficult for the model to distinguish between authentic and manipulated frames. Similarly, unstable network conditions introduce latency and data loss, affecting real-time performance and limiting its reliability in mission-critical applications such as forensic investigations and real-time surveillance. A potential solution lies in advanced preprocessing and adaptive feature learning strategies, such as multiscale detail enhancement and semantic-weighted feature fusion. These techniques have proven effective in maintaining detection accuracy under compression conditions and improving resilience to video quality degradation [37].
Another key limitation is the lack of extensive robustness testing against emerging adversarial attack techniques, which exploit weaknesses in the model’s feature extraction and classification pipeline. Adversarial attacks can subtly alter video content in a way that remains undetectable to traditional models, posing a significant security risk. Additionally, the model, while optimized for real-time processing, still faces latency issues under resource-constrained settings, making deployment in high-speed, real-world applications a challenge. To mitigate these issues, future work should focus on refining adaptation mechanisms for unseen forgeries and implementing time-aware learning techniques [38, 39]. Incorporating similar time-sensitive neural architectures into video tampering detection could improve adaptability, allowing models to evolve with emerging forgery techniques. Ultimately, optimizing computational efficiency, integrating adversarial defense mechanisms, and developing adaptable learning frameworks are crucial for enhancing real-world deployment.
6. Conclusion
This study has presented a lightweight and effective deep learning framework for video forgery detection, specifically optimized for deployment in resource-constrained environments. The proposed model demonstrated strong capabilities in capturing both spatial and temporal tampering cues while maintaining computational efficiency suitable for edge and real-time applications. Extensive evaluations across multiple benchmark datasets have validated the robustness of the approach, particularly under adverse conditions such as compression artifacts, environmental variability, and network limitations. Compared to conventional CNN-based baselines, the proposed framework has consistently outperformed in terms of detection accuracy and inference speed. Nevertheless, several limitations persist. The model exhibits reduced accuracy under extreme visual degradation and remains vulnerable to dynamic scene inconsistencies. To address these challenges, future work will explore the incorporation of transformer-based architectures and attention mechanisms, which may enhance feature representation across longer temporal windows. Additionally, optimization techniques such as quantization and pruning should be employed to further reduce inference latency on low-power edge devices. Looking forward, the integration of decentralized learning paradigms, such as federated learning, offers a promising direction for enabling privacy-preserving and scalable tampering detection across distributed networks. This line of advancement positions the proposed framework as a foundational step toward practical, trustworthy, and adaptive video forensic systems for real-world multimedia security.
Consent
No consent is required.
Disclosure
The funder had no impact on the results or outcomes of the study in any form.
Author Contributions
Yuwen Shao: formal analysis and writing – original draft.
Qiuling Wang: validation and writing – original draft.
Junsong Zhang: writing and review–editing and supervision.
Haiying Tian: writing and review–editing and supervision.
Yong Zhang: conceptualization and supervision.
Funding
The study was supported by China Tobacco Henan Industrial Co., Ltd. Postdoctoral Project, Project Number: BHW202405, Scheme No: Yu Yan Technology [2024] Number 4.
Acknowledgments
The authors have nothing to report.
[1] P. Johnston, E. Elyan, C. Jayne, "Video Tampering Localisation Using Features Learned from Authentic Content," Neural Computing & Applications, vol. 32 no. 16, pp. 12243-12257, DOI: 10.1007/s00521-019-04272-z, 2020.
[2] T. D. Nguyen, S. Fang, M. C. Stamm, "Videofact: Detecting Video Forgeries Using Attention, Scene Context, and Forensic Traces," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, .
[3] N. A. Shelke, S. S. Kasana, "A Comprehensive Survey on Passive Techniques for Digital Video Forgery Detection," Multimedia Tools and Applications, vol. 80 no. 4, pp. 6247-6310, DOI: 10.1007/s11042-020-09974-4, 2021.
[4] S. Tan, B. Chen, J. Zeng, B. Li, J. Huang, "Hybrid deep-learning Framework for Object-based Forgery Detection in Video," Signal Processing: Image Communication, vol. 105,DOI: 10.1016/j.image.2022.116695, 2022.
[5] P. Sambandam Raju, R. Arumugam Rajendran, M. Mahalingam, "Aka-Net: Anchor Free-based Object Detection Network for Surveillance Video Transmission in the IOT Edge Computing Environment," Pattern Analysis & Applications, vol. 27 no. 2,DOI: 10.1007/s10044-024-01272-1, 2024.
[6] X. Zhou, X. Xu, W. Liang, Z. Zeng, Z. Yan, "Deep-Learning-Enhanced Multitarget Detection for end–edge–cloud Surveillance in Smart Iot," IEEE Internet of Things Journal, vol. 8 no. 16, pp. 12588-12596, DOI: 10.1109/jiot.2021.3077449, 2021.
[7] Y. Wu, H. Guo, C. Chakraborty, M. R. Khosravi, S. Berretti, S. Wan, "Edge Computing Driven low-light Image Dynamic Enhancement for Object Detection," IEEE Transactions on Network Science and Engineering, vol. 10 no. 5, pp. 3086-3098, DOI: 10.1109/tnse.2022.3151502, 2023.
[8] U. Singh, S. Rathor, M. Kumar, "Hybrid Deep Learning and Machine Learning Approach for Detecting Spatial and Temporal Forgeries in Videos," Neural Computing & Applications, vol. 11,DOI: 10.1007/s00521-024-10558-8, 2024.
[9] J. Chen, K. Li, Q. Deng, K. Li, P. S. Yu, "Distributed Deep Learning Model for Intelligent Video Surveillance Systems with Edge Computing," IEEE Transactions on Industrial Informatics, vol. 1 no. 1,DOI: 10.1109/tii.2019.2909473, 2024.
[10] Y. Zhou, B. Li, Z. Wang, H. Li, "Integrating Temporal and Spatial Attention for Video Action Recognition," Security and Communication Networks, vol. 2022 no. 1, pp. 5094801-5094808, DOI: 10.1155/2022/5094801, 2022.
[11] W. Dou, X. Zhao, X. Yin, H. Wang, Y. Luo, L. Qi, "Edge computing-enabled Deep Learning for real-time Video Optimization in Iiot," IEEE Transactions on Industrial Informatics, vol. 17 no. 4, pp. 2842-2851, DOI: 10.1109/tii.2020.3020386, 2021.
[12] C. Fathy, S. N. Saleh, "Integrating Deep Learning-based Iot and Fog Computing with software-defined Networking for Detecting Weapons in Video Surveillance Systems," Sensors, vol. 22 no. 14,DOI: 10.3390/s22145075, 2022.
[13] M. A. Humayun, A. Alsirhani, F. Alserhani, M. Shaheen, G. Alwakid, "Transformative Synergy: Ssehcet—Bridging Mobile Edge Computing and AI for Enhanced Ehealth Security and Efficiency," Journal of Cloud Computing, vol. 13 no. 1,DOI: 10.1186/s13677-024-00602-2, 2024.
[14] Y. Yao, Y. Shi, S. Weng, B. Guan, "Deep Learning for Detection of Object-based Forgery in Advanced Video," Symmetry Plus, vol. 10 no. 1,DOI: 10.3390/sym10010003, 2017.
[15] S. Chen, S. Tan, B. Li, J. Huang, "Automatic Detection of Object-based Forgery in Advanced Video," IEEE Transactions on Circuits and Systems for Video Technology, vol. 26 no. 11, pp. 2138-2151, DOI: 10.1109/tcsvt.2015.2473436, 2016.
[16] R. Kim, G. Kim, H. Kim, G. Yoon, H. Yoo, "A Method for Optimizing Deep Learning Object Detection in Edge Computing," 2020 International Conference on Information and Communication Technology Convergence (ICTC), .
[17] K. Rezaee, S. M. Rezakhani, M. R. Khosravi, M. K. Moghimi, "A Survey on Deep Learning-based real-time Crowd Anomaly Detection for Secure Distributed Video Surveillance," Personal and Ubiquitous Computing, vol. 28 no. 1, pp. 135-151, DOI: 10.1007/s00779-021-01586-5, 2024.
[18] L. Zhang, T. Qiao, M. Xu, N. Zheng, S. Xie, "Unsupervised Learning-based Framework for Deepfake Video Detection," IEEE Transactions on Multimedia, vol. 25, pp. 4785-4799, DOI: 10.1109/tmm.2022.3182509, 2023.
[19] W. Guo, S. Z. Du, H. Deng, Z. Yu, L. Feng, "Towards spatio-temporal Collaborative Learning: an end-to-end Deepfake Video Detection Framework," 2023 International Joint Conference on Neural Networks (IJCNN), .
[20] X. Yang, J. A. Esquivel, "Lstm Network-based Adaptation Approach for Dynamic Integration in Intelligent end-edge-cloud Systems," Tsinghua Science and Technology, vol. 29 no. 4, pp. 1219-1231, DOI: 10.26599/tst.2023.9010086, 2024.
[21] K. Pawar, V. Attar, "Deep Learning Approaches for Video-based Anomalous Activity Detection," World Wide Web, vol. 22 no. 2, pp. 571-601, DOI: 10.1007/s11280-018-0582-1, 2019.
[22] V. Kumar, V. Kansal, M. Gaur, "Multiple Forgery Detection in Video Using Convolution Neural Network," Computers, Materials & Continua, vol. 73 no. 1, pp. 1347-1364, DOI: 10.32604/cmc.2022.023545, 2022.
[23] R. Cheng, X. Zhao, Z. Wang, T. Sun, Y. Liu, B. Shi, "Detection of Deepfake Technology in Images and Videos," International Journal of Ad Hoc and Ubiquitous Computing, vol. 45 no. 2, pp. 135-148, DOI: 10.1504/ijahuc.2024.10062154, 2024.
[24] Z. Zhao, Y. Zeng, J. Wang, H. Li, H. Zhu, L. Sun, "Detection and Incentive: a Tampering Detection Mechanism for Object Detection in Edge Computing," 2022 41st International Symposium on Reliable Distributed Systems (SRDS), pp. 166-177, DOI: 10.1109/srds55811.2022.00024, .
[25] M. Raveendra, K. Nagireddy, "Tamper Video Detection and Localization Using an Adaptive Segmentation and Deep Network Technique," Journal of Visual Communication and Image Representation, vol. 82,DOI: 10.1016/j.jvcir.2021.103401, 2022.
[26] M. M. Eltoukhy, F. S. Alsubaei, A. M. Mortda, K. M. Hosny, "An Efficient Convolution Neural Network Method for copy-move Video Forgery Detection," Alexandria Engineering Journal, vol. 110, pp. 429-437, DOI: 10.1016/j.aej.2024.10.030, 2025.
[27] N. Bai, X. Wang, R. Han, J. Hou, Y. Wang, S. Pang, "PIM-Net: Progressive Inconsistency Mining Network for Image Manipulation Localization," Pattern Recognition, vol. 159,DOI: 10.1016/j.patcog.2024.111136, 2025.
[28] Z. Zhao, S. Zhao, F. Lv, S. Si, H. Zhu, L. Sun, "RIETD: a Reputation Incentive Scheme Facilitates Personalized Edge Tampering Detection," IEEE Internet of Things Journal, vol. 10, 2024.
[29] A. K. Sahu, K. Umachandran, V. D. Biradar, "A Study on Content Tampering in Multimedia Watermarking," SN Computer Science, vol. 4 no. 3,DOI: 10.1007/s42979-022-01657-1, 2023.
[30] N. Singla, J. Singh, S. Nagpal, B. Tokas, "HEVC Based Tampered Video Database Development for Forensic Investigation," Multimedia Tools and Applications, vol. 82 no. 17, pp. 25493-25526, DOI: 10.1007/s11042-022-14303-y, 2023.
[31] S. Deo, S. Mehta, D. Jain, "Video Tampering Detection Using Machine Learning and Deep Learning," International Advanced Computing Conference, pp. 444-459, DOI: 10.1007/978-3-031-35644-5_36, .
[32] S. Tyagi, D. Yadav, "A Detailed Analysis of Image and Video Forgery Detection Techniques," The Visual Computer, vol. 39 no. 3, pp. 813-833, DOI: 10.1007/s00371-021-02347-4, 2023.
[33] W. Lu, W. Xu, Z. Sheng, "An Interpretable Image Tampering Detection Approach Based on Cooperative Game," IEEE Transactions on Circuits and Systems for Video Technology, vol. 33 no. 2, pp. 952-962, DOI: 10.1109/tcsvt.2022.3204740, 2023.
[34] A. Vaswani, "Attention is all You Need," Advances in Neural Information Processing Systems, vol. 30, 2017.
[35] S. Tabatabaei, K. Rezaee, M. Zhu, "Attention Transformer Mechanism and Fusion-based Deep Learning Architecture for MRI Brain Tumor Classification System," Biomedical Signal Processing and Control, vol. 86,DOI: 10.1016/j.bspc.2023.105119, 2023.
[36] N. Wu, X. Jin, Q. Jiang, "Multisemantic Path Neural Network for Deepfake Detection," Security and Communication Networks, vol. 2022 no. 1,DOI: 10.1155/2022/4976848, 2022.
[37] D. Li, J. A. Esquivel, "Trust-Aware Hybrid Collaborative Recommendation with locality-sensitive Hashing," Tsinghua Science and Technology, vol. 14, 2023.
[38] P. Ma, S. Petridis, M. Pantic, "Detecting Adversarial Attacks on Audiovisual Speech Recognition," ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6403-6407, DOI: 10.1109/icassp39728.2021.9413661, .
[39] H. Yan, X. Wei, "Efficient Sparse Attacks on Videos Using Reinforcement Learning," Proceedings of the 29th ACM International Conference on Multimedia, pp. 2326-2334, DOI: 10.1145/3474085.3475395, .
Copyright © 2025 Yuwen Shao et al. International Journal of Intelligent Systems published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License (the “License”), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/