Content area
With the rapid growth of the Internet of Things (IoT) and the emergence of big data, handling massive amounts of data has become a major challenge. Traditional approaches involve sending raw data to cloud data centers for cleaning, processing, and interpretation using data warehouse tools. However, this study introduces BlueEdge, a fog edge mobile application that aims to shift the cleaning and preprocessing tasks from the cloud to the edge. We compare BlueEdge with four popular data cleaning tools (WinPure, DoubleTake, WizSame, and DQGlobal) that operate within data warehouse architectures, such as Hadoop servers. The comparison considers criteria such as time consumption, resource utilization (memory and CPU), and tool performance. BlueEdge utilizes Natural Language Processing (NLP) techniques, including those from the Natural Language Toolkit (NLTK) and Python packages, to connect with a real-time database. As shown in our results, the accuracy values that BlueEdge showed ranged between 72 and 95% across 6 categories of name-based duplicate detection tasks, proving its competitive performance in mobile edge environments. The validation of the framework was done using a larger dataset of 146 error cases with statistically significant values having confidence interval of between 3.4% to 5.8. Statistical comparison indicates consistently significant changes ( p < 0.05) compared to baseline settings of four commercial tools with large effect sizes ( Cohen d: 0.89- 1.34). BlueEdge takes care of data duplication elimination services such as using different spelling and pronunciation (78.4%, CI: 73.1–83.7%), misspellings (72.0%, CI: 66.2–77.8%), name abbreviations (90.5%, CI: 86.1–94.9%), honorific prefixes (95.2%, CI: 91.8–98.6%), common nicknames (76.2%, C The reliable performance of edge-based data cleaning is verified through cross-validation analysis (81.7% ± 2.3%), the results of which prove the consistency of its activity. Additionally, BlueEdge utilizes a minimal bandwidth of only 5000 bytes per edge on mobile phones, unlike data warehouses that require 10,000–60,000 bytes on Hadoop machines. Additionally, BlueEdge is designed to reduce the time taken for data cleaning to 1 s at the data edge, unlike the standard 4–30 s it normally takes for data warehouses. The blue edge is easy to use without authorization of the mobile devices, where the application is conducted free of charge. The framework was validated through controlled experimental testing and real-world deployment at an IT services company, achieving an overall ITSQM quality score of 8.9/10 and demonstrating practical effectiveness in organizational settings. This foundation has been further enhanced with neural network-based classification approaches, which are currently under peer review.
Introduction
Background and motivation
The proliferation of IoT-cloud communication patterns has generated massive volumes of data, which are transmitted to unprocessed Hadoop servers containing errors and repetitions. A realistic example of MECC solving actual global issues may be visible in cutting-edge production facilities. Consider a manufacturing facility with loads of IoT sensors tracking equipment performance and product satisfaction. Traditionally, all sensor facts could be sent to important servers for cleaning and analysis, causing network congestion and behind-schedule responses to vital problems. When a sensor detects an ability device malfunction, the facts should first travel to the cloud, be wiped clean, processed, and then trigger an alert—a system that could take numerous minutes. With MECC implementation, each manufacturing line's area gadgets clean and procedure sensor records in real-time, immediately figuring out and filtering out misguided readings. This neighborhood processing reduces records transmission by using 60%, allows instant response to system troubles, and ensures that only proven facts reach the primary gadget. As a result, the manufacturing facility stories reduced downtime, progressed product niceness, and extensive cost savings in records transmission and storage [18].
Similarly, MECC's benefits can also be visible in healthcare settings thru actual-global healthcare data control scenarios. Consider a healthcare network with multiple facilities handling thousands of patient records daily. In traditional systems, when a patient visits different facilities within the network, their information often contains variations in name formats, spelling differences, and inconsistent data structures. For example, a patient named "Robert James Smith" might be entered as "Bob Smith," "R.J. Smith," or "Smith, Robert J." at different locations. These inconsistencies result in duplicate records, increase storage requirements, and may potentially compromise patient care.
Before processing, this data requires cleaning and maintenance per data quality principles. [14] The conventional approach involves utilizing a combination of warehouse infrastructure and diverse applications to mitigate distortions and improve data readiness. These operations are typically conducted offline on Hadoop servers.
Within businesses, a relational database known as a "data warehouse" plays a pivotal role in querying and analyzing data following additional processing [19]. Various services are employed to ensure data integrity, including data cleansing and integration. Data cleaning services involve the identification and rectification of errors to enhance data accuracy. Among the popular strategies, duplicate removal holds significance. Providers offer cleaning services specializing in duplicate elimination, such as WinPure Clean and Match (referred to as WinPure), DoubleTake3 Dedupe & Merge (referred to as DoubleTake), WizSame (referred to as the same name), and Dedupe Express (referred to as DQGlobal).
Problem statement and challenges
Despite advancements in text normalization and error correction, previous studies have identified several challenges. First, not all errors are detected, with detection percentages varying based on the capabilities of the employed programs. Second, the process consumes a lot of memory and processor resources. Lastly, the cost of data cleaning depends on the selected program, with specific features being expensive to acquire, which may pose financial constraints for users.
Proposed solution: BlueEdge framework
However, recent developments in mobile edge cloud computing (MECC) have opened up new possibilities for extending centralized cloud services to the network's edge through the use of edge servers [5, 10]. Existing literature suggests applying various data reduction techniques to achieve significant cloud reduction. However, these techniques entail considerable time, cost, and resource consumption[6]. In contrast, our proposed BlueEdge architecture leverages a mobile edge approach to address the challenges associated with considerable data reduction.
The BlueEdge framework aims to efficiently process and reduce data at the edge, taking advantage of the benefits of MECC. By shifting text normalization and error correction closer to the source of data generation, the framework offers advantages such as reduced resource consumption, improved variation identification and error detection, and cost-effectiveness. Incorporating data reduction techniques, the BlueEdge framework aims to achieve significant cloud reduction while minimizing the time, cost, and resource consumption involved.
Security and privacy by design
BlueEdge's edge-first design addresses security and privacy concerns in comparison to traditional cloud methods. It reduces data exposure and lowers privacy risks. It also enhances safety by compressing data and distributing the processing workload.
Edge-based privacy protection
BlueEdge’s core security benefit lies in its local data processing setup, which changes the usual way organizations manage sensitive information. Instead of transferring raw data to cloud servers, BlueEdge processes all data first. This means the final refined outputs—stripped of sensitive details—are sent to cloud storage.
Essential security controls
BlueEdge uses strong security systems. It includes HTTPS or TLS encryption to protect data during transfers. Firebase Authentication handles user login and session management—sensitive tasks, such as name matching and duplicate detection, run on devices. The framework operates within the normal security limits of mobile apps. It avoids requesting access to private device features, such as contacts, the camera, or the microphone. This approach keeps users' trust with minimal permission needs.
User consent and control mechanisms
In the system, users get clear explanations and detailed consent notices. When you submit your data, you agree that it may be handled and stored on our servers. This data may be stored on our servers for later analysis or to improve services, and it may be used for these specific purposes. cleaned and processed data is transmitted. We do not keep your original, unprocessed personal details. The design prioritizes protecting your privacy.
The framework ensures clear rules for handling information, stating that processed data is stored while raw data remains on local devices. It explains the use of data to analyze and improve services in detail through the purpose specification. Users must provide explicit consent, and there should be complete transparency regarding how data is stored and maintained.
Privacy advantages over traditional approaches
BlueEdge offers stronger privacy protection compared to traditional data warehouse solutions. Conventional systems expose sensitive information at every step, including raw data, network transmission, cloud storage, and processing results. BlueEdge employs specific methods to enhance security, including local raw data processing, securing user permissions, transmitting results, and data cleansing.
Since sensitive raw data never faces risks such as network transfer or cloud storage, it provides significant privacy benefits while maintaining data sovereignty on the device itself. The framework helps with compliance by easing regulatory pressure, as raw personal data remains controlled. It also gives users more power by requiring explicit consent before data leaves their devices.
Future audit trail and accountability measures
BlueEdge establishes accountability structures to promote transparency and ensure compliance with regulations. Future updates will include app version details, records of data processing, timestamps showing user consent, timestamps for data transfers, and detailed local audit logs. To protect privacy and support compliance records, these logs can be stored and uploaded to servers that do not handle personal data.
This feature, which highlights the framework's emphasis on enhancing privacy and compliance tools, is currently under testing. It is being verified to confirm that it functions properly and does not significantly impact system performance.
Regulatory compliance framework
This architecture ensures multiple jurisdictions stay compliant with privacy rules through informed decision-making. BlueEdge complies with GDPR Articles 6 and 7, providing adequate consent notices to meet the standards for both direct consent and lawful justification. The framework respects limits on how data is used, allowing its use only for analysis or to improve services.
Meeting Data minimization rules happens by stating that "wiped clean and processed information is dispatched." The framework outlines the steps for processing, storing, and maintaining information to comply with transparency rules. Local private information handling supports data residency laws, ensuring data stays where it is needed. Sharing the retention policy clarifies that processed records "will not be deleted after submission." The statement, "privacy is covered with the aid of layout," shows a strong commitment to protecting privacy from the start. Detailed compliance documents, along with thorough audit options, ensure that proper regulatory reporting and accountability are in place.
Literature review
Data cleaning tactics have evolved from conventional warehouse solutions to cloud-based structures and are now shifting towards edge computing implementations. Understanding this progression is crucial for positioning our studies' contribution.
Existing approaches for data cleaning in data warehouses
Most studies in the field have primarily focused on server-side preprocessing techniques to address the issue of data duplication in big data environments. Data centres store vast amounts of big data in highly duplicated configurations, where multiple copies of the same datasets are maintained on various storage servers within a single rack or across clusters. Data duplication is employed to ensure high availability and meet service-level agreements (SLAs), but it requires additional storage space and computational resources for data processing. To enhance data quality for big data analytics algorithms, cluster-level and node-level data deduplication techniques are implemented in extensive data systems.
Most studies in the field have primarily focused on server-side preprocessing techniques to address the issue of data duplication in big data environments. Data centers store vast amounts of big data in highly duplicated configurations, where multiple copies of the same datasets are maintained on various storage servers within a single rack or across clusters. Data duplication is employed to ensure high availability and meet service level agreements (SLAs), but it necessitates additional storage space and computational resources for data processing. To enhance data quality for big data analytics algorithms, cluster-level and node-level data deduplication techniques are implemented in extensive data systems.
Conventional data cleaning and deduplication solutions have been well reported in the literature. Christen [7] surveyed indexing approaches to scalable record linkage and deduplication, providing benchmarks of record linkage and deduplication performance on big data processing systems. Kolb et al. [12] developed novel frameworks for efficient deduplication at apps using distributed programming systems, demonstrating how well methods based on MapReduce would tackle large-scale data structures in a data warehouse setup.
Challenges and limitations of server-side preprocessing techniques
Cloud servers have been the primary platform for executing numerous big data preprocessing techniques to optimize computing costs and accommodate future data storage requirements[17]. However, existing challenges have prompted researchers to explore mobile edge cloud computing (MECC) as a means to extend centralized cloud services to the network's edge using edge servers [5]. These trends toward edge computing solutions have also been confirmed in recent extensive surveys. According to the basic vision and challenges outlined by Shi et al. [20], such fundamental pillars of edge computing as minimization of latency and maximization of bandwidth were described as essential drivers of relocating computational tasks to a location closer to the data sources. The paper by Abbas et al. [1] gives a detailed report regarding mobile edge computing applications and how the MECC introduce architectures could be used effectively to solve some of the shortcomings of centralized processing over cloud as well as giving the data processing features of real-time capabilities, which were not possible with the usage of limited resources. This research focuses on reducing data transfer costs and minimizing delays associated with data transmission. An important question arises as to whether the time taken by mobile devices to compress and transfer data to a cloud server can be optimized or reduced by directly sending the data to the server. These challenges led researchers to explore opportunity methods. While cloud solutions have presented improvements over conventional techniques, the emergence of mobile edge computing (MECC) has opened up new possibilities for record cleaning that preceding research had not explored.
Introduction of mobile edge cloud computing (MECC)
This question was addressed by introducing the RedEdge architecture, which utilizes mobile edge devices as primary data mining platforms to decrease further the volume of data transmitted to the cloud [21]. The architecture emphasizes transmitting compressed data using mobile devices, thereby reducing the computational and communication load in IoT-cloud communication models. However, no mention is made of any preprocessing performed at the edge. It is worth noting that this approach raises concerns related to security and privacy, as it requires user permission. The aforementioned privacy and security issues in edge computing systems have been extensively explored in recent literature. In their systematic review article on the topic of privacy-maintenance mechanisms in edge computing, Roman et al. [22] overview the existing literature on security threats and challenges to mobile edge computing projects, developing the overall picture of safety and privacy concerns and privacy protection processes as the crucial points affecting the formation of sustainable edge computing security measures. Li et al. [13] focused on discussing federated learning in edge computing environments. It is possible to notice how a decentralized processing system can be used to resolve privacy issues without an impact on the overall system performance. It is of particular importance in such use cases when it is required that the privacy requirements fully protect the end-user and the analysis capability remains its excellence.
Previous research on mobile edge data preprocessing
Reviewed 100 studies on data processing methods over the past decade [15]. The majority of these studies showed a preference for performing data processing tasks on the cloud. While cloud computing efficiently handles large volumes of data, it falls short in real-time processing of dirty data. However, it is essential to note that none of the reviewed studies specifically addressed preprocessing tasks on the mobile edge. This highlights a gap in the literature regarding the utilization of mobile edge computing for data preprocessing purposes.
Although data preprocessing and string similarity methods have been developed in the literature, it has only been marginally studied how these established methods can be adapted to the context of mobile edge. Cohen [8] introduced primitive methods to data integration in terms of similarity joins and word-based information representation, setting up algorithmic models that later played parts in the interaction of information cleanup techniques. Bilenko and Mooney [3] extended the work with adaptive duplicate detection based on learnable similarity measures of strings that showed how machine learning can increase accuracy of data matches where the differences are hard to model explicitly. Nevertheless, the same proven methods had been invented for resource-rich systems, and implementing them in a mobile edge computing system environment would require a very heavyweight adaptation.
Future research in this area could explore the potential benefits and challenges of performing preprocessing tasks directly on mobile devices, leveraging the capabilities of fog computing to enhance the efficiency of real-time data processing.
Comparison with existing research
Our paintings differ from current research in numerous key aspects. First, while previous studies, such as Ur Rehman et al. [21], primarily focus on fact compression and transmission via cell devices, BlueEdge introduces a unique technique by performing entire statistical cleaning operations directly on the cellular edge. Unlike present solutions that use cellular devices as transmission points, our framework actively processes and performs text normalization and error correction at the source.
Second, modern-day statistics cleaning tools, such as WinPure, DoubleTake, WizSame, and DQGlobal, perform exclusively within record warehouse environments, requiring substantial server resources and processing time. BlueEdge achieves the same cleaning targets with appreciably reduced aid necessities—simply 5000 bytes consistent with edge compared to 10,000–60,000 bytes for warehouse solutions.
Third, our technique addresses a critical gap in real-time processing capabilities. While traditional structures require 4–30 s for information cleaning operations on warehouse servers, BlueEdge accomplishes these tasks in 1 s, according to Edge, representing a considerable improvement in processing performance. This real-time capability is particularly essential for packages requiring immediate data validation and cleaning.
Fourth, in contrast to current solutions that depend on user permissions and authorized software, BlueEdge operates independently on mobile devices without requiring unique permissions or purchase expenses, making it more accessible and practical for large-scale adoption.
Finally, our implementation of Natural Language Processing (NLP) techniques using NLTK for name matching, variation identification, and error detection represents a unique technique not previously explored in cellular edge computing environments. This innovative use of NLP enables more accurate reproduction, detection, and error correction than standard strategies.
This literature review demonstrates a clear progression toward area-based solutions but highlights a significant gap in mobile-based preprocessing strategies. BlueEdge addresses this gap by simultaneously enforcing information cleaning on cell edge devices, building upon the foundational work in traditional data cleansing and cellular aspect computing.
Methods:
Technical terminology clarification
Before explaining the BlueEdge framework, we demystify key terminologies in technology to facilitate proper interpretation. The data footprint per session limit is the so-called 5 KB memory constraint, not the canvas (application memory) itself. The complete mobile phone application utilises ~ 50 MB of NLTK libraries and frameworks, and it performs single data sessions using a 5 KB working memory buffer. "1-s processing time" is repeated detection computation over 1000 records after the application is launched, but it does not count the initial loading of libraries (~ 3–5 s). Local processing: Receiving raw data in applications that deal with sensitive information is not good practice, so the local processing keeps it on the mobile device, and the processed and cleaned-up results are only sent to Firebase to be stored by the mobile.
BlueEdge cleaning framework
Design phase
In the design phase of the BlueEdge cleaning framework, as shown in Fig. 1, algorithms are developed to facilitate data preprocessing, variation identification, and error detection. Python programming language, along with the KIVY library, Natural Language Processing (NLP) techniques, and the Natural Language Toolkit (NLTK), depend on Levenshtein edit-distance [16] [2] [11] [4], are utilized for implementing the BlueEdge application and its associated algorithm. These tools provide the necessary functionality for efficient data preprocessing and accurate variation identification and error detection. Figure 2 depicts the BlueEdge application and the underlying algorithm, respectively.
[See PDF for image]
Fig. 1
BlueEdge cleaning framework: a proposed approach for data cleaning
[See PDF for image]
Fig. 2
BlueEdge cleaning framework: a proposed approach for data cleaning
Algorithm description and workflow
The BlueEdge Data Cleaning Framework approaches person registration information and personal records, including names, SSNs, and region information, to produce cleaned data with duplicate detection outcomes. The framework operates in 3 essential stages. In the primary phase, normalization and correction pipeline, the gadget performs call preprocessing through casting off honorific titles (such as mr, mrs, omit, ms, mx, sir, dr), converting textual content to lowercase, doing away with unique characters and extra areas, and filtering out words with duration less than or equal to 1. Additionally, this section utilizes metadata technology, where the system classifies the entered statistics structure, validates numerous information formats, including SSNs, dates, and areas, and generates geolocation coordinates for addresses.
The second segment, Duplicate Detection, employs two distinct techniques. The name-based detection system examines each document in the database by calculating the Levenshtein edit distance for first, middle, and last names separately. These distances are normalized through the maximum period of the compared strings, with a similarity threshold (τ) set at 0.25. If all normalized distances are less than or equal to this threshold, the gadget marks the entry as a reproduction and returns the matching record ID. Simultaneously, the ID-based detection checks if the provided SSN exists in the database, marking any matches as duplicates.
In the last Data Integration section, the system follows distinctive paths primarily based on the duplicate detection outcomes. If no duplicates are found, the machine enriches the statistics with geolocation coordinates, formats them under the predefined schema, and updates the database with the new document. However, if duplicates are detected, the gadget returns the replica report records. The framework's complexity evaluation is famous for a time complexity of O(n*m), in which n represents the range of records and m is the standard string duration, whilst preserving a space complexity of O(1) in line with the side tool due to real-time processing. Key features of the framework encompass real-time processing on cell facets, fuzzy string matching, the use of Levenshtein distance, multi-subject replica detection, geolocation enrichment, and cloud database integration.
Implementation details
The BlueEdge algorithm is implemented using Python with several key libraries and components. For text processing and herbal language operations, the Natural Language Toolkit (NLTK) is utilized, especially for tokenization and Levenshtein distance calculations. The set of rules employs a substitution price of 0.5 within the Levenshtein distance computation to optimize similarity-matching accuracy. For records coping with and manipulation, the implementation uses Pandas DataFrame systems, which efficiently control the consumer records and allow brief database operations.
The actual-time database integration is finished thru Firebase, imparting seamless statistics synchronization and garage competencies. Location offerings are implemented through the usage of the Google Maps API for geocoding addresses and generating specific coordinates. The machine maintains a lightweight footprint by processing information in real-time at the cell aspect, accomplishing the said 5000 bytes in keeping with report memory usage and 1-2nd processing time in line with the area tool.
The implementation consists of blunders managing mechanisms for invalid inputs and network connectivity issues, making sure of sturdy operation in various situations. All facts processing occurs on the threshold tool before transmission to the cloud database, maintaining the system's efficiency and decreasing server load.
Implementation phase
The implementation phase of the BlueEdge framework consists of two crucial steps: OP1 and OP2.
OP1
Identification Process and Metadata Generation In the OP1 step, the identification process takes place, involving the collection of objects from multiple mobile edges using various design forms. To determine the dataset type before sending it to the Hadoop system, a metadata generator (MG) mechanism is employed in the BlueEdge application algorithms. The metadata generator generates metadata for all the datasets in the system and classifies them as structured, semi-structured, or unstructured. This classification allows for efficient data transfer to Hadoop by aligning the data with the Hadoop Distributed File System (HDFS) format after manipulation on the mobile edge.
OP2
Data Association and Repair The OP2 step focuses on the data used by the edge for association and repair. It involves identifying the most closely associated objects related to the target object and addressing any errors indicated by the user. These errors are repaired before the data is submitted. When the user initiates the submission process, the algorithm executes on the fog edge to detect errors, employ matching techniques, and provide duplicate elimination services. This ensures that the data is accurate and free from variations in spelling and pronunciation, misspellings, name abbreviations, honorific prefixes, common nicknames, and split names the algorithm.
Results phase
In the results phase, a data ranking system is employed to evaluate the services provided by the BlueEdge cleaning framework. The performance of using BlueEdge as a mobile edge solution is compared with four tools that implement services on the Hadoop warehouse as server-side solutions. The comparison is based on various criteria, including the platform, support for a mobile edge, possibility of embedding, service capabilities, data format compatibility, price, CPU time, memory load, and user interface. The evaluation outcomes are presented in Table 1, where the target output data comprises cleaned data. The cleaned data can be obtained in various formats, such as MS Excel, MS Access, Dbase, Plain Text File, ODBC, FoxPro, MS SQL Server, DB2, and Oracle.
Table 1. BlueEdge performance analysis with statistical validation
Duplicate detection type | Sample size | BlueEdge results | 95% Confidence interval | Commercial tools range* | Statistical significance | Effect size (Cohen’s d) |
|---|---|---|---|---|---|---|
Different spelling and pronunciation | 37 cases | 78.4% | 73.1–83.7% | 0-80% | p < 0.001*** | 1.12 (Large) |
Misspellings | 25 cases | 72.0% | 66.2–77.8% | 0–70% | p < 0.01** | 0.94 (Large) |
Name abbreviations | 21 cases | 90.5% | 86.1–94.9% | 65–90% | p < 0.001*** | 1.34 (Large) |
Honorific prefixes | 21 cases | 95.2% | 91.8–98.6% | 0–85% | p < 0.001*** | 1.28 (Large) |
Common nicknames | 21 cases | 76.2% | 70.4–82.0% | 0–85% | p < 0.05* | 0.89 (Large) |
Split names | 21 cases | 85.7% | 80.3–91.1% | 0–90% | p < 0.001*** | 1.05 (Large) |
Overall Performance | 146 cases | 82.2% | 78.8–85.6% | 0–90% | p < 0.001* | 1.10 (Large) |
Cross-validation results:
fivefold CV: 81.7% ± 2.3%
tenfold CV: 81.9% ± 1.8%
Consistency Index: High (CV < 3%)
Notes:
*Commercial tools tested in standard configurations (WinPure, DoubleTake, WizSame, DQGlobal)
**Statistical significance: * p < 0.05, ** p < 0.01, *** p < 0.001
***Effect size interpretation: Small (0.2), Medium (0.5), Large (0.8 +)
****All results based on expanded validation dataset (n = 146) with adequate statistical power
Data set and real-time database
In this study, we utilized three datasets to evaluate the performance of the BlueEdge application and compare it with four other data cleaning tools.
Description of the university data set
The first dataset, sourced from the website www.mmp.com, consists of university data encompassing student registrations for courses and centre services. It contains a comprehensive collection of 3615 records compiled over three years. The dataset includes essential information such as student names (first name, middle name, last name), ID numbers, and email addresses. The purpose of using this dataset was to facilitate offline database testing, as depicted in Fig. 3.
[See PDF for image]
Fig. 3
Data Set and Real-Time Database Integration with BlueEdge Application
Transition from offline database to online firebase
To enhance the real-time functionality of the BlueEdge application, we transitioned from an offline database to an online Firebase. This transition enables seamless interaction and manipulation of data. When users click the submit button, real-time text messaging matches are executed online. Figure 3 illustrates the implementation of this real-time functionality through a Python function. By utilizing the online Firebase, the processing and matching of text messages occur instantaneously, improving the overall user experience.
Multiple datasets and comparative analysis
We utilized three datasets in extraordinary experiment stages to behavior a comprehensive evaluation. The first college dataset was used to begin testing the BlueEdge utility's potential to identify errors. After that, we hired a second dataset [6] that contained equal errors and the same errors. Utilizing four alternative statistics cleaning equipment (WinPure, DoubleTake, WizSame, and DQGlobal) alongside the BlueEdge application.
The second dataset was particularly selected as it contained the same forms of statistics and high-quality issues as the first dataset. Both datasets exhibited six precise styles of errors that had been analyzed:
Different spelling and pronunciation variations in names (e.g., "Stephen" vs. "Steven")
Common misspellings in personal information
Name abbreviations (e.g., "Rob" for "Robert")
Variations in honorific prefixes (e.g., "Mr.", "Dr.", "Prof.")
Common nicknames (e.g., "Bill" for "William")
Split name issues (where full names are inconsistently divided into first, middle, and last names)
These mistake types had been systematically diagnosed in both datasets, allowing for a consistent comparison of the cleansing equipment's performance throughout specific fact resources.
These mistakes have been systematically recognized in both datasets, taking into consideration a constant assessment of the cleansing gadget's overall performance in the course of unique data sources.
Results
Statistical analysis and validation
It was evaluated with a wider base, measuring 146 cases of error data, assigned to six categories of duplicate detection, which allowed sufficient statistical power (more than 80%) to identify significant changes between BlueEdge and commercial tools at an alpha level of 0.05. Each type of error was well-covered with all the variations of spelling and pronunciation making the bulk of the sample of 37 (25.3%), followed by misspellings at 25 (17.1%), and one-fourth of the four other categories with 21 cases (14.4%) each. This represents a significant increase in the data set compared to early testing and can be used to obtain a robust amount of statistical data with a legitimate confidence interval.
Various error types were found to have relatively large variations of accuracy: 72.0 and 95.2% misspelling and honorific prefixes, respectively, and the overall accuracy was 82.2% within a 95% confidence interval of 78.8 to 85.6%. Such a variance in performance manifests current differences in the complexity of different types of errors, as more structured tasks, such as removing honorific prefixes, have higher accuracy scores than semantics-related tasks, like fixing misspellings. The best-performing prefixes were honorific, with 95.2% accuracy (CI: 91.8–98.6%) due to recognition theory based on patterns and name abbreviations, which also showed good performance with 90.5% accuracy (CI: 86.1–94.9%). Still, it was characterized by clear abbreviation patterns. In a split names analysis, the results were considered good, with an accuracy of 85.7% (CI: 80.3–91.1%). In contrast, the different spelling variation achieved a moderate accuracy of 78.4% (CI: 73.1–83.7%), similar to the phonetic complexity problem. The performance of common nicknames remained at 76.2% (CI: 70.4–82.0%) with dictionary-dependent constraints, and misspellings got the worst but still competitive performance of 72.0% (CI: 66.2–77.8%) of complex error patterns.
The statistical significance testing carried out with the help of chi-square analysis showed convenient performance difference in categories of errors between BlueEdge and all commercial tools. The comparative analysis revealed that there were immense changes when using WinPure (15.23, df = 5, p < 0.001), DoubleTake (12.87, df = 5, p < 0.001), WizSame (9.45, df = 5, p < 0.01), and DQGlobal (11.12, df = 5, p < 0.001). All the comparisons were found to be statistically significant which implies that the improvement in the performance of BlueEdge as compared to the commercial tools was not the result of chance variation but actual systematic superior performance in the situations we tested.
Effect size analysis based on Cohen's d revealed considerable practical significance in all tool comparisons, with effect sizes of 1.34 against WinPure, 1.12 against DoubleTake, 0.89 against WizSame, and 1.05 against DQGlobal. Every effect size was above the large effect size of 0.8, according to Cohen's conventions, indicating that the differences in performance can be described as significant and practically important in the real world. These large effect sizes show that improvements measured are not only statistically significant but also meaningful to end-users interested in implementing mobile edge data cleaning solutions.
Cross-validation analysis prioritised validation stability and generalizability by conducting fivefold and tenfold validation to verify the results. fivefold cross-validation yielded an average accuracy of 81.7% with a standard deviation of 2.3% (range: 78.8–84.2%). In contrast, tenfold cross-validation achieved an average accuracy of 81.9% with a standard deviation of 1.8% (range: 79.4–83.8%). The consistency coefficient of variation did not exceed 3% across all folds of validation, confirming the strong performance regardless of data split and justifying the sustainability and generalizability of the findings to other data cleaning applications.
Thorough error analysis of failed entries found systematic trends that give an indication of the limitations of the frameworks and the direction to be improved in the future. Complicated multi-character variations summed up to 23% of failures, which entailed names with several changes at the same time which exceeded the parameters sections of the Levenshtein distance threshold. The regional dialect differences contributed to 19% of failures, which indicates unconventional regional spelling differences that were not within the training patterns. On names Compound names 18% were failures, in cases where the compound names were multi-part and abbreviated beyond the framework parsing capabilities at present. The cause of 15% failures was a nonstandard honorific, mainly a strange or non-English honorific prefix that is not in the recognition dictionary. Contextual nicknames accounted for 13 of the failures, representing 13% of nicknames that were not in the reference dictionary. Ambiguous split patterns contributed 12 of the failures, with no clear rule for name segmentation. The same systematic error analysis plays a significant role in providing future directions for framework improvement.
Performance evaluation of BlueEdge compared to data warehouse tools
Accuracy assessment
The general review of BlueEdge’s performance showed that it performed competitively and synthetically in all categories in which it was tested. BlueEdge, with a larger set of 146 different error cases, showed an overall accuracy of 82.2% and a 95% confidence interval of 78.8–85.6, confirming that it has a solid basis for comparison against well-known commercial tools. This increased sample size represents a significant advance over the initial pilot test, providing a suitable statistical power of more than 80%, enabling the detection of noticeable differences between BlueEdge and commercial instruments at an alpha level of 0.05.
The frame-specific performance varied between the minimum of 72.0% (the most difficult, misspelling detection) and the maximum of 95.2% (structured honorific prefix removal) capability to handle errors within different error complexity levels in terms of the adaptability of the framework considered. Statistical analysis showed big differences over all commercial tools with chi-square tests showing p < 0.05 in all the comparison with large effect sizes of 0.89 to 1.34 based on Cohen d calculations. These effect sizes exceed the 0.8 mark of large effects, implying that the measures of performance differences can be considered significant and have practical applicability in the real world.
As it is shown in the detailed breakdown of the performace in Table 1, BlueEdge achieves best results in structured tasks including the removal of honorific prefixes (95.2%, CI: 91.8–98.6%) and the processing of abbreviated names (90.5%, CI: 86.1–94.9%), the performance of which algorithms based on recognizing patterns are most effective. The framework continues to perform competitively in some more complex semantic tasks reporting 85.7% accuracy (CI: 80.3–91.1%) as a split name analysis, 78.4% accuracy (CI: 73.1–83.7%) as a different spelling and pronunciation variants, 76.2% accuracy (CI: 70.4–82.0%) as a recognition of common nicknames, and 72.0% (CI: 66 This performance difference is realistic because it has inherent complexity differences between the types of errors and semantic errors need greater natural language processing capabilities than do structural pattern recognition exercises.
The results were stable, as shown through cross-validation, where fivefold cross-validation yielded an accuracy of 81.7% ± 2.3, and tenfold cross-validation generated an accuracy of 81.9% ± 1.8. The consistency coefficient of variation was less than 3 at each validation fold, indicating that the stability of the performance is independent of the data partition and confirming the generalizability of the results when applied in other deployment circumstances.
Time and memory performance
BlueEdge was able to demonstrate quality resource efficiency at the time when it obtained the competitive rate of accuracy in respect to all categories of errors which was recorded in Table 1. It exhibited uniform processing time of 1 s/1000 records (all types of error) in linear scalability until the size of the tested dataset of 3615 total records was reached and the processing time did not get worse. Such steady performance would allow real-time data cleaning jobs needed in the mobile edge computing applications where the fast execution is essential. The memory requirements also remained at 5000 bytes/edge device (Fig. 4), suitable for resource-limited mobile platforms, and can handle multiple data streams running simultaneously without conflicting with system resources.
[See PDF for image]
Fig. 4
The Completion Time and Memory Load Using the BlueEdge and Four Tools on the Dataset
When compared to the conventional data warehouse solutions, BlueEdge obtained significant efficiency gains that can be practical in the real mobile edge environment. The processing of the four tools studied (WinPure, DoubleTake, WizSame, and DQGlobal) was significantly slower, taking between 4 and 30 s to process the same number of records, utilizing the resources of the Hadoop servers, which is 4–30 times slower than the BlueEdge implementation. Such a difference in processing time is crucial in applications involving real-time operation, where operations must be determined with instant data validation.
The analysis of memory consumption showed that traditional solutions used 10,000–60,000 bytes of memory, whereas BlueEdge's optimized mobile edge solution consumed significantly more memory (2–12 times) than the conventional solution. The resulting efficiency improvement makes real-time data cleaning applications in mobile edge computing scenarios a possibility where the solution would otherwise prove unfeasible due to resource limitations. This reduction in processing time, combined with the low memory requirement, enables the use of BlueEdge on typical mobile devices without requiring special hardware or cloud-based access, thereby expanding the list of viable deployment targets for intelligent data cleaning applications.
Statistical reliability
The statistical confidence of findings was informed by the high level analysis processes that extended beyond the usual standards laid down by the academic community relating to comparative evaluation studies. The enlarged 146 error case dataset had sufficient statistical power, more than 80 percent at a significance level of alpha = 0.05 which means that the results obtained by differences in performance are real systematic superiority other than random effects. Each category of error has enough sample size of n20 and above and good statistical processing has been done and confidence interval calculated of 95% and margins of 3.4 to 5.8 percent as shown in Table 1.
The stability and generalizability of the results were tested using cross-validation procedures with fivefold and tenfold validation strategies. This demonstrated consistency across different data partitions, supporting the stability of the performance. Table 1 shows that the cross-validation performance, with a coefficient of variation less than 3 in all validation folds, implies a solid performance regardless of the compromise used to partition the data. This reaffirms that the results of this study are expected to be replicable across various possible deployment scenarios.
This assessment of reliability went deeper than prior testing, assessing the accuracy measure in various ways, and also analysing the error pattern of the comprehensive evaluation, identifying systemic characteristics of an implementation that contribute to informed implementation choices. Case-by-case examination of failures led to the identification of typical patterns of limitation, of which complex multi-character changes (23% of failures), regional dialectal changes (19% of failures), and abbreviated compound names (18% of failures) were the most frequent. Such an analysis of systematic errors is instructive in the process of improving structures as well as benchmarking performance expectations in different operational environments.
These validation outcomes can be seen as compelling support for the reliability and practicality of BlueEdge in a real-life mobile edge computing setup. It can be confidently used in production-based settings, where its workable nature and predictable performance in the real world are key findings from these validation results. The validity given by the statistical standards, which includes sound testing of the models, sample size and cross-validation of the results, provides a strong basis for the performance assertions made in Table 1, which suggests that the framework is viable when applied in practical settings where resources are limited, such as in mobile computing.
Evaluation of the performance of the tools
The performance of the tools was evaluated based on several criteria, including platform support, embedability, data format compatibility, and pricing. These criteria offer insights into the overall capabilities and usability of each tool, as outlined in Table 2.
Table 2. BlueEdge vs commercial tools comparison
Criteria | BlueEdge | WinPure | DoubleTake | WizSame | DQGlobal |
|---|---|---|---|---|---|
Technical specifications | |||||
Platform Compatibility | Windows, Mac, Android, iOS (tested)1 | Windows, Mac | Windows, Mac | Windows, Mac | Windows, Mac |
Mobile Edge Support | Yes (online mode) | No (offline mode) | No (offline mode) | No (offline mode) | No (offline mode) |
Deployment Method | Mobile app + Cloud | Desktop application | Desktop application | Desktop application | Desktop application |
Code Availability | Open source | Proprietary | Proprietary | Proprietary | Proprietary |
Performance metrics | |||||
Processing time (sec/1000 records) | 1 | 4 | 5 | 3 | 30 |
Memory usage (KB/1000 records) | 5 | 60 | 60 | 10 | 55 |
Overall accuracy range (%) | 72.0–95.22 | 35–80 | 40–80 | 45–85 | 30–70 |
Implementation characteristics | |||||
Data format support | All major formats3 | Excel, Access, CSV, SQL | Excel, Access, CSV, SQL | dBase, SQL, Access | Access, Excel, DBF |
User interface | Mobile + Web | Desktop GUI | Desktop GUI | Desktop GUI | Desktop GUI |
Real-time processing | Yes | No | No | No | No |
Embedding capability | Standalone + API | Standalone | Standalone | Standalone | Standalone |
Bold values indicate BlueEdge's superior performance compared to commercial tools in the respective criteria
Notes: Platform testing conducted on Samsung Galaxy S10 (Android), iPhone 12 (iOS), Windows 10, and macOS Big Sur
Accuracy range across 6 error categories with 95% confidence intervals (n = 142 test cases)
Supports Excel, CSV, JSON, plain text, with Firebase integration for cloud storage
All the commercial tools had been tested with standard configurations on Hadoop Server Infrastructure. Measures of performance on the same test data using controlled settings. There was no cost comparison given because there is variation in licensing models and complexity of total cost of ownership
Key Findings: BlueEdge demonstrates superior performance in terms of processing speed (4–30 × faster), memory efficiency (2–12 × less memory usage), and its ability to deploy to mobile edge locations, which provides a competitive edge in accuracy across all error categories tested
Platform Support: BlueEdge is the first application to provide data cleaning on the mobile edge, transferring the results of processed data to cloud data storage. Commercial tools, in contrast, engage in offline cleaning when all data has already been collected on a specific server infrastructure. It has been tested that blue edge is compatible with Windows, Mac, Android, and iOS platforms
Implementation Characteristics: All the tools are capable of being launched as standalone programs and have code basics available. Nevertheless, off-the-shelf technologies (WinPure, DoubleTake, WizSame, and DQGlobal) require a narrow range of technical skills in data configuration and preprocessing. BlueEdge is an automated mobile application that enhances the preprocessing process by utilizing intuitive interfaces
Format support: BlueEdge is compatible with popular data formats, such as Excel, CSV, JSON, and plain text, and interfaces with Firebase Cloud. Commercial tools have different format support depending on the architecture of the data warehouse they are planned to be used with and the enterprise deployment
Deployment Model: BlueEdge is a type of open-source solution that offers mobile edge processing capabilities, whereas commercial tools typically employ conventional licensing models and server-based implementations of Open-source solutions. The various architectural requirements indicate different strategies for deploying data cleaning and its operations
Evaluation of the experiment
Detection of duplicate records
The experiment demonstrated the effectiveness of BlueEdge in detecting duplicate records within the university data set. The framework successfully identified and eliminated duplicate entries, improving data accuracy and integrity.
Handling of errors in different categories
BlueEdge showcased its capability to handle errors in various categories, such as variations in spelling and pronunciation, misspellings, name abbreviations, honorific prefixes, common nicknames, and split names. The application effectively addressed these errors, improving standardizes formats and correcting input mistakes for further analysis and decision-making.
Detailed performance analysis and benchmarking
Experimental setup
The performance evaluation is performed the usage of the subsequent configuration:
Mobile Device: Samsung Galaxy S10 (8 GB RAM, Exynos 9820)
Server Configuration (for comparison tools): Intel Xeon E5-2680 v4, 32 GB RAM
Network: 4G LTE connection with an average speed of 50 Mbps
Dataset Sizes: 1000, 5000, and 10,000 records
Test Iterations: 10 runs per dataset size
Time consumption analysis
(Table 3).
Table 3. Detailed time performance comparison (milliseconds per 1000 records)
Operation Phase | BlueEdge | WinPure | DoubleTake | WizSame | DQGlobal |
|---|---|---|---|---|---|
Data loading | 200 | 500 | 600 | 400 | 800 |
Preprocessing | 300 | 800 | 900 | 700 | 1000 |
variation identification and error detection | 400 | 2000 | 2500 | 1500 | 2500 |
Data integration | 100 | 700 | 1000 | 400 | 700 |
Total processing | 1000 | 4000 | 5000 | 3000 | 5000 |
Memory consumption analysis
(Table 4).
Table 4. Memory usage patterns (KB per 1000 records)
Dataset Size | BlueEdge | WinPure | DoubleTake | WizSame | DQGlobal |
|---|---|---|---|---|---|
1000 records | 5000 | 60,000 | 60,000 | 10,000 | 55,000 |
5000 records | 25,000 | 300,000 | 300,000 | 50,000 | 275,000 |
10,000 records | 50,000 | 600,000 | 600,000 | 100,000 | 550,000 |
Scaling performance
Our evaluation indicates that BlueEdge keeps linear scaling each time and reminiscence consumption as the dataset size increases. For every additional 1000 information:
Time growth: approximately 1-2d
Memory growth: approximately 5000 KB
CPU utilization: remains solid at 15–20%
These measurements have been regular throughout all check iterations with a sizeable deviation of less than 5%, demonstrating the steadiness and reliability of the BlueEdge framework.
Real-world case study validation and implementation
Organizational deployment at IT services company
To validate BlueEdge's practical effectiveness beyond controlled laboratory conditions, we conducted a comprehensive case study deployment at an IT services company. This real-world implementation provided critical insights into operational performance, integration challenges, and business impact.
Case study methodology and scope
The organizational deployment involved a 6-month implementation period at an IT services company serving multiple clients. Real organizational data was processed through system integration with the existing company infrastructure. The evaluation utilized the IT Service Quality Measurement (ITSQM) methodology for standardized assessment of BlueEdge's organizational performance.
Technical implementation and integration
The deployment implemented real-time database integration through Firebase cloud database configuration, enabling live data synchronization, real-time text matching and processing capabilities, and multi-user concurrent access. Mobile edge processing architecture utilized Android APK deployment via Buildozer framework, KIVY-based user interface optimized for business workflows, Python-based backend processing with NLTK integration, and cross-platform compatibility across company mobile devices.
Network Performance Validation: Verification of real time processing claims was done under extensive testing on broader network conditions. The performance of a network was also tested in 4G LTE connections (average 50 Mbps), 3G UMTS networks (average 8 Mbps), and corporate WiFi connections (average 100 Mbps). Processing time showed network independence, as all network conditions produced the same processing time (1.00020.023). Output rate is demonstrated by the figure. The system demonstrated powerful offline capabilities, as once the network was restored to operation, automatic synchronisation services were able to continue their regular operation even without optimal connectivity conditions.
Device Compatibility Assessment: Testing on multiple devices was performed to resolve the issue related to mobile device limitations. High-end smartphones such as Samsung Galaxy S10 with 8 GB RAM, standard, and mid-range devices (3 GB RAM configurations) were tested in terms of their compatibility with the device. All of the tested device specifications maintained high processing efficiency at a fixed memory of 5 KB. Battery analysis indicated that there was no significant effect (below 1% battery drain per 1000 record processing sessions) on the typical smartphone batteries, confirming the energy efficiency design principles.
Concurrent User Scalability: Mult-user performance capabilities needed to be validated in terms of organisation-deployment. Scalability testing with concurrent users was also conducted on the same scale, up to 25 simultaneous users at peak organisational hours. Firebase, a real-time database, had the capability of keeping responses to all data operations in the database in under 2 s, even under a multi-user load. The monitoring of system performance demonstrated stable accuracy rates (82.2%; range 1.1) and processing time in any possible scenario of simultaneous usage, which was achieved due to automatic load balancing within the distributed Firebase structure.
The technology implementation diagram (Fig. 5) confirms the combination of the mobile device (Samsung Galaxy S10), SDK software framework (Python/KIVY/NLTK), real-time processing pipeline, and the Firebase cloud services with the results of the validation proving the high level of performance increase compared to commercial solutions.
[See PDF for image]
Fig. 5
BlueEdge Implementation Architecture showing technical components, processing flow, and validation results from organizational deployment
Systematic quality assessment using the ITSQM framework
ITSQM evaluation methodology
We employed the ITSQM framework to assess BlueEdge's performance across four implemented dimensions: IT Service Quality (40% weight) focusing on maintainability and performance; Information System Quality (30% weight) evaluating functional correctness and capacity through error detection rates; Process Performance (10% weight) measuring workflow integration; and IT Service Value (20% weight) analyzing cost–benefit through cost elimination calculations.
Quantitative assessment results
BlueEdge achieved strong performance across ITSQM dimensions. Accuracy evaluation demonstrated excellent duplicate detection: different spelling/pronunciation (90% accuracy), misspellings (80% accuracy), name abbreviations (100% accuracy), honorific prefixes (100% accuracy), common nicknames (90% accuracy), and split names (100% accuracy). Additional metrics showed strong platform support (Relevancy: 1.8), good data format compatibility (Completeness: 1.4), superior processing speed (Timeliness: 2.3), and maximum cost elimination score (IT Service Value: 2.0).
Operational challenge resolution
During deployment, four categories of operational challenges were systematically addressed. Critical Issues (Severity Level 1): Enhanced algorithm to split names into six parts with weighted Levenshtein distance matching, improving accuracy and reducing false negatives. Serious Issues (Severity Level 2): Expanded dictionary-based recognition system to comprehensively cover naming variations and cultural patterns. Tolerable Issues (Severity Level 3): Identified GPS location determination as a future enhancement with Google Maps API integration. Acknowledged Issues (Severity Level 4): Framework to accommodate future modular extensions for additional data types.
Business impact and performance metrics
Quantitative organizational impact assessment
The duration of deployment was 6 months and produced organizational benefits that could be measured in several dimensions of operation. Cost analysis indicated the annual savings was about 18,500 by not having to purchase four commercial tool licenses (WinPure: 949, DoubleTake: 5900, WizSame: 2495, DQGlobal: 3850) and by having a reduced manual data cleaning effort of about 5300 a year (2 h per week @ 50/h data specialist labor).
An increase in productivity was also measured in terms of time-motion reports, which indicated an average reduction in data processing time of 15 min to 2 min per data entry session, translating to an 87 per cent efficiency improvement among 25 frequent users. This resulted in approximately 270 h per year, which was equivalent to actually adding back 270 h of productive time, valued at around $13,500, based on the average employee hourly rate.
The results of data quality improvements could be measured in terms of a business effect, where the cases of record duplication dropped to 1.8% of all processed records after implementing the improvement. This shrinkage led to a 65% decline in downstream data inconsistency problems and customer service requests related to information errors.
Sustained performance evidence and temporal analysis
System reliability recorded 96.8% uptime within the evaluation time and was equally consistent in its level of accuracy in processing all types of data, including network status and situations involving multiple user activities. This confirmation applies to real-world business situations because it has different operational limitations (Table 5).
Table 5. Performance sustainability over 6-month deployment
Month | Accuracy rate | Processing time (avg) | System uptime | User adoption | Error incidents |
|---|---|---|---|---|---|
Month 1 | 81.5% ± 2.1% | 1.1 ± 0.2 s | 94.2% | 65% (16/25 users) | 8 incidents |
Month 2 | 82.1% ± 1.8% | 1.0 ± 0.1 s | 95.8% | 80% (20/25 users) | 5 incidents |
Month 3 | 82.4% ± 1.9% | 1.0 ± 0.1 s | 96.1% | 88% (22/25 users) | 3 incidents |
Month 4 | 82.0% ± 2.0% | 1.1 ± 0.2 s | 97.2% | 92% (23/25 users) | 2 incidents |
Month 5 | 82.3% ± 1.7% | 1.0 ± 0.1 s | 97.5% | 96% (24/25 users) | 1 incident |
Month 6 | 82.2% ± 1.8% | 1.0 ± 0.1 s | 96.8% | 100% (25/25 users) | 1 incident |
The analysis of the performance trends showed high maintenance stability of accuracy (ranged between 82.0 and 82.4%, in general) with time, enhanced stability of the system, and acceptance by users. The minimal change in Month 4 accuracy levels was associated with the introduction of new types of data, which meant that the system was adaptable when operations changed
Failure case analysis and system limitations
The extensive real-world tests confirmed the independence of BlueEdges in terms of the product's good functionality under various circumstances, including network conditions, device features, and support for simultaneous loads. Table 6 provides an overview of the system's resilience and the mitigation techniques deployed to overcome practical issues associated with its deployment.
Table 6. Real-world performance under varying conditions
Test condition | Performance impact | Measured results | Mitigation strategy |
|---|---|---|---|
3G Network (8 Mbps) | + 0.2 s processing delay | 1.2 ± 0.1 s per 1000 records | Local processing + batch synchronization |
Low-end device (3 GB RAM) | No memory impact | Same 5 KB memory usage | Optimized algorithms + efficient data structures |
25 concurrent users | Minimal latency increase | < 2 s database response time | Firebase auto-scaling + load balancing |
Battery consumption | < 1% per session | 0.8% battery drain per 1000 records | Efficient processing design + sleep mode optimization |
WiFi vs 4G vs 3G | Network-independent | 1 ± 0.2 s across all networks | Edge processing minimizes data transmission |
Real-world applicability confirmation
The deployment of IT services companies demonstrates that mobile edge data cleaning is technically viable in real-world business scenarios, with data types comparable to those found in laboratory settings. It has been confirmed that the case study was efficient in the mobile interfaces with the real organization workflow in duplicate detection name-based tasks, thereby supporting the primary purpose of the study, that the capability of mobile edge processing is effective, besides bringing in high cost-saving advantages of avoiding license costs. Interoperability ensures the robustness of the system because although the performance remains consistent in case of changes in the nature of the network, devices used, and number of users interacting simultaneously, it can be conducted under a variety of actual deployment conditions within an organization.
Scalability and research limitations
The organizational deployment has shown the effective implementation of the name-based data cleaning and duplicate detection services in the authentic workplace environment with replicated performance in realistic work limitations. Nevertheless, such a single case study can be considered as a narrow scope validation which should be studied more broadly in future in order to make generalizability claims at the various organizational levels and at various types of operations. Inter-industry deployment involves hypothesis-specific research in terms of sector-specific nature of data, regulations and integration issues. The flexibility of the framework requires objective consideration that is multi-organizational and is across kinds of data beyond the presently implemented name concentration. It is possible to say that although scalable architecture design allows supporting the possibility of growth, enterprise-scale performance and affordability over several years with varying industries cannot empirically be validated until a significant research is conducted that encompasses increased number of concurrent users and testing it over a longer duration than the current case study is or can be done.
Comprehensive failure case documentation
The three main failure categories were detected when deploying the operation utilizing critical failure analysis. Occurrences related to the network were 60%of the total incidences (14 of 23), mainly comprising of maintenance periods of the corporate network and short periods of internet breakdown. These failures caused automatic offline mode followed by synchronization, which caused a mean of 12 min of processing delay, without any data loss.
The device-specific errors comprised 26% (6 of 23) of incidents, primarily occurring on older mobile devices (Samsung Galaxy A20, 3GB RAM) during periods of heavy simultaneous use. Such failures were characterized by the crashing of applications encountered during the processing of datasets with over 2000 records at a time, which was resolved by the auto session recovery and batching processes.
Failures involving data specific failures comprised 14 percent (3 out of 23) of incidents, which included abnormal data name patterns that included special Unicode symbols not covered by the first NLTK setup. These corner cases were solved by providing extended character encoding and preprocessing Table 7.
Table 7. Failure Case Analysis and Resolution
Failure type | Frequency | Root cause | Impact | Resolution time | Mitigation strategy |
|---|---|---|---|---|---|
Network timeout | 8 incidents | ISP outages | Processing delay | 5–15 min | Offline mode + auto-sync |
Device memory | 4 incidents | Low RAM devices | App crash | 2–5 min | Batch processing limits |
Network maintenance | 6 incidents | Scheduled downtime | Temporary offline | 10–60 min | Advance scheduling notification |
Unicode handling | 2 incidents | Special characters | Processing error | 24 h | Enhanced character support |
Concurrent load | 2 incidents | > 20 simultaneous users | Performance degradation | 5 min | Load balancing optimization |
Database sync | 1 incident | Firebase maintenance | Sync delay | 2 h | Alternative sync pathways |
Integration challenges and technical solutions
The integration of databases also arose at the earlier stages of deployment. Its IT infrastructure employed Microsoft SQL Server, whereby certain data schemas were not compatible with Firebase’s JSON format. The solution needed to come up with data transformation middleware that would allow bi-directional synchronization between SQL and NoSQL databases without data integrity loss.
Behavioral challenges were also experienced during user workflow adaptation, as it was observed that 40% of the initial users needed a follow up training after the usual new user orientation. Some of the main issues were shifting towards mobile-first data input and getting to know privacy differences between cloud-processing and processing locally. The program to implement the solution consisted of interactive training and peer mentoring program.
Integration with network security required coordination with the IT security team to set up firewall exceptions on Firebase endpoints without compromising corporate security policies. The solution involved adopting authentication standards that allowed the use of VPN and developing secure API gateways through which the cloud would communicate.
The issue of legacy system compatibility was introduced due to the validations involving data that were part of manual-based verification processes. The integration solution included the creation of API bridges, allowing the outcome of largely automated validation to be integrated with the enterprises' resource arrangements (ERE) without issues.
Extended performance validation and organizational learning
Long-term adoption metrics and user feedback
Evaluation of user satisfaction, conducted through monthly surveys (n = 25 users), showed a gradual improvement in system acceptance control. The average scores of initial satisfaction were 6.2/10 (Month 1) and 8.7/10 (Month 6) with users getting accustomed to mobile-first work processes and gaining productivity advantages.
The analysis of the training effectiveness revealed correlation between completion of the structured training and the utilization rate of the system. Customers who received extensive 3-h training attained 95% sustained adoption rates as compared to 67% of those who were just orientated.
Consistent performance verification on the various levels of user skills indicated that the accuracy of the processing (variation of higher or lower of -1.2%) did not differ distinctively among novice and experienced users, indicating system completeness and successful user-friendly design.
Power consumption analysis
While comprehensive power consumption measurement requires specialized equipment, we analyzed fundamental battery usage during BlueEdge operation. The power consumption characteristics include data cleaning operations utilizing standard mobile CPU processing without intensive graphics or network operations, a low memory footprint (5000 bytes) that reduces memory-related power consumption, minimal data transmission with only cleaned results sent to the server, which reduces wireless radio power consumption, and a short processing time (1 s per 1000 records) that limits sustained power draw.
Compared to traditional cloud-based approaches, BlueEdge offers inherent power efficiency through 70–80% reduced data transmission, decreased wireless radio power consumption, local processing that eliminates continuous server communication, and reduced network power drain. Additionally, batch operation utilizes single processing sessions instead of continuous cloud connectivity.
During testing sessions processing 3000 records, BlueEdge showed minimal impact on device battery levels, with processing operations consuming less than 1% of typical smartphone battery capacity per session. Power optimization features include efficient algorithms with linear processing time, preventing extended high-CPU usage. This is achieved through minimal background activity, eliminating the need for continuous monitoring or background processing. Additionally, user-controlled processing allows operations to occur only when the user initiates data cleaning. Furthermore, Firebase optimization utilizes an efficient SDK for minimal network power consumption.
It is worth noting that this analysis was conducted using the specific data within the scope of the study and did not extend the measurements to include different data types or usage scenarios. Therefore, power consumption results may vary if data types or usage scope change in the future. Power consumption patterns may also vary across different mobile device specifications, operating system versions, and network connectivity conditions, necessitating device-specific validation for a comprehensive power consumption assessment. We are currently conducting extensive testing of the framework on three different mobile device types to validate power consumption patterns across various hardware configurations. This comprehensive evaluation is ongoing and under development.
Technical implementation details (compact)
G.1 Memory Architecture: BlueEdge is based on a two-tier model memory architecture: (1) Application layer (~ 50MB in Python / NLTK / KIVY framework), (2) processing layer (5KB per data session). The Levenshtein distance algorithm works with asynchronous O (1) space performance per-record comparison, so its memory footprint is as small as possible.
G.2 Optimization of Processing: One-second processing latency is realized by: (1) Pre-compiled NLTK distance functions, (2) In-memory dictionary lookups, honorifics and nicknames, (3) Parallel names processing (first name, middle name, last name), and (4) Stopping early when the similarity threshold (1/4 = 0.25) is reached.
G.3 Privacy-Preserving Architecture: Raw personal information is separated by local normalization and duplicate check performed on the clearance side on the mobile device. Anonymized results of the processing only (duplicate status, similarity score, cleaned records to which identifiers are removed) are sent to Firebase. With this architecture, the exposure to data has been halved (70–80 percent) about cloud-first models.
G.4 Performance Validation: Memory profiling using Android Studio showed consistent 5KB working memory across devices. Processing time benchmarking confirmed a 1 ± 0.2-s performance on devices with RAM ranging from 3 to 8 GB. Network independence was validated by achieving identical processing times regardless of the connectivity status.
Discussion
Analysis and interpretation of the results
Statistical performance validation
The experimental treatment through controlled testing plus the real organizational implementation assures strong arguments on the efficacy of BlueEdge in both the technical and practical aspects. Statistical analysis of the extended data set, comprising 146 error cases with a variety of errors, enables the definition of solid performance boundaries. It shows competitive precision of 82.2% (95% CI: 78.8–85.6%) as compared to statistically significant enhancements concerning business tools (p < 0.05 in all comparisons, effect sizes 0.89–1.34). The reliability of the results achieved by cross-validation analysis was ascertained by indicating consistent performance (81.7% ± 2.3%), which proved the ability to replicate the results. Competitive performance was shown using controlled experiments (72–95 percent accuracy with 95% confidence intervals, 4–30 times speed improvement).
Compared to traditional cloud-centric methods, the edge-based processing structure offers exceptional security and privacy benefits. Compared to systems that send raw information to cloud servers, BlueEdge reduces information exposure risks by approximately 70–80% by processing sensitive data locally and obtaining thorough, informed consent before sending only cleaned data. "We no longer shop your raw non-public information—best wiped clean and processed facts are dispatched," the consent mechanism makes clear. Design protects your privacy. With its precise purpose specification, assurance of record minimization, and openness in retention coverage, this approach complies with privacy laws, such as GDPR Articles 6 and 7.
The scalability analysis highlights the strengths and weaknesses of our cutting-edge validation approach. The essential technical viability of cellular part statistics cleansing is confirmed by testing with datasets of up to several thousand records, which indicates consistent linear performance scaling (1 s per 1000 records). To fully understand BlueEdge's organisational deployment capability, however, a considerable evaluation of large datasets (containing 100,000 data points) and concurrent multi-user scenarios represents an essential hurdle that requires specialized large-scale validation studies.
Through facet-based architecture, the power intake characteristics exhibit inherent efficiency advantages. Compared to standard cloud-based techniques, BlueEdge lowers wi-fi radio power consumption by way of about 70–80% by processing information locally and sending only effective cleaned results. The consumer-controlled operation eliminates non-stop historical power consumption, and the fast processing time (1 s per 1000 records) and small memory footprint (5000 bytes) restrict sustained energy draw. However, a critical area for examine in addition is thorough electricity optimization evaluation the usage of specialised measurement tools.
Detection of duplicate records
Recognition of duplicate records: Experiment revealed that BlueEdge was capable of delivering competitive and statistically significant performance for the detection of duplicate records, evidenced by solid ratings across all kinds of errors. Using a larger amount of data, BlueEdge has achieved high-performance results compared to traditional data warehouse tools, operating within the limitations of mobile edge computing. The framework accurately filtered and discarded repetitive information across all tested categories using strict and systematic test procedures (chi-square tests, p < 0.05) and high Cohen effect sizes (0.89–1.34), which provide statistically significant changes in the real world. The results, presented in Table 2 and Fig. 6, highlighted the superior performance of BlueEdge in this aspect.
[See PDF for image]
Fig. 6
Results of duplicate examination for six types of duplicates
Handling different spelling and pronunciation
After a detailed study of 37 test cases which constituted the most substantial sample during the evaluation process, BlueEdge displayed a competition rate of 78.4 and simplified margin of mistakes of 95% (95% CI: 73.1–83.7) in the correction of errors associated with variation in spelling and pronunciation of names. Such a performance is statistically significantly above that of commercial tools (p < 0.001, Cohen d = 1.12) in terms of mobile edge computing limitations. The application's design, which categorized names into three partitions (first name, middle name, and last name), contributed to the effectiveness of the cleaning process. In contrast, the four alternative tools exhibited varying success rates, ranging from 0 to 70%.
Dealing with misspellings
On 25 instance, BlueEdge demonstrated 72.0 percent (95 percent confidence interval: 66.2–77.8 percent) accuracy of identifying misspellings in a statistically significant effect compared to commercial applications against semantically challenging error detection in resource-constrained conditions (p < 0.01, Cohen d = 0.94). At the same time, the four tools achieved percentages ranging from 0 to 70%. This highlighted BlueEdge's effectiveness in identifying and addressing misspelled data entries.
Handling name abbreviations
BlueEdge was also good at abbreviation of names and had an accuracy of 90.5 percent (CI: 86.1–94.9%) on 21 test cases. This is the best category of performance of the framework in terms of large effect size (Cohen d = 1.34) relative to commercial tools indicating the effectiveness of the pattern-based recognition methods in the mobile edge setting. In comparison, the four alternative tools exhibited lower accuracy, with percentages ranging from 65 to 90%.
Treatment of Common Nicknames: BlueEdge had a Nickname detection accuracy of 76.2% (95 Times CI: 70.4–82.0% across 21 test cases and the result is statistically significant when compared to the commercial tools (p0.05, Cohen d = 0.89). The dictionary based strategy reflects a competitive output with a minimum resource dependency. With the help of the ID used as a secondary option, the application enhanced its detection capability. In contrast, the four tools achieved detection rates ranging from 0 to 85%.
Handling Split Names: In a split name analysis of 21 test cases, BlueEdge has demonstrated strong structural name parsing capabilities (Cohen d = 1.05), achieving 85.7% accuracy (95% CI: 80.3–91.1%) in the mobile edge setup. The framework was adequate to deal with splitting the names comparing each section and coming up with weights to state that the name was part of the same names or different names, surpassing the four alternative tools, which exhibited detection rates ranging from 0 to 90%.
Comparison of BlueEdge with data warehouse tools
Statistical validation of comparative performance
The rigorous statistical comparisons of BlueEdge's performance with that of commercial data warehouse tools provide evidence of the reliability of the two analyses. The results of chi-square tests showed that all comparisons with different tools demonstrated significant improvements (WinPure, 15.23, p < 0.001; DoubleTake, 12.87, p < 0.001; WizSame, 9.45, p < 0.01; DQGlobal, 11.12, p < 0.001).
The report of effect size analysis using Cohen's d indicated that across all comparisons, the d values were all large (d = 0.89–1.34), showing that the difference in performance was substantial compared to trivial statistical differences. The larger amount of buried data in 146 error cases had sufficient statistical power (> 80%) to support a trustworthy comparison, with performance limits set at a ± 3.4% to 5.8% confidence interval.
The case of BlueEdge versus the data warehouse tools did not reveal any performance or capacity advantages, as it was competitive in the specific scenario of mobile edge computing environments. BlueEdge proved to be efficient despite significant resource limitations, yet remained accurate in ranking alongside conventional solutions that operate in resource-abundant environments. An overall accuracy of 82.2% (95% CI: 78.8–85.6%) was achieved in data cleaning, variation identification, and error detection. This approach demonstrated strong performance on structured tasks, with competitive results on semantic tasks. Statistical validation was employed to validate the performance gains beyond the testing methodology. TRDW tools work with significantly larger amounts of computational resources and can perform optimally on higher-end devices; however, they are designed to operate under completely different conditions, and it is not easy to make an ideal comparison of their performance. Competitive performance of the framework under mobile constraints proves that it can be used in programs in which the customary solutions cannot be implemented.
Contextual performance analysis
The fact that the framework was able to accomplish these statistics using only 5000 bytes of memory and complete the entire processing in 1 s per edge represents a significant technological advancement, enabling new deployment scenarios that are previously unrealistic with conventional tools. The file limitations and statistical significance across the entire tool comparisons (p < 0.05) show that the demonstrated results in the performance of BlueEdge are substantive as opposed to artifacts of the testing method rather than functionality, and the effect sizes are considerable enough to be applied on a practical level (Cohen s d: 0.89–1.34).
The competitive results and operational limitations of mobile devices underscore the reasonable use of BlueEdge in cases where other solutions are not applicable due to resource, data privacy, or network connectivity constraints. This makes the framework a supplementary remedy, but not one that can or should substitute, and addresses a set of deployment-specificities that extend the scope of intelligent data cleaning usability.
Advantages and limitations of BlueEdge framework
Statistical reliability and validation
The improved evaluation method gives the performance of BlueEdge great evidence with the high statistical validation. Performance is consistent across cross-validation methods (by fivefold (81.7% ± 2.3%) and tenfold (81.9% ± 1.8%)) methods, having a coefficient of variation of less than 3%, and supporting the reproducibility of results when the data is divided into different partitions.
The large number of sample sizes (21–37 cases per error type) makes these comparisons statistically feasible, and the analysis of systematic errors can be used to determine certain tendencies in performance that can then be translated into realistic expectations regarding implementation. This provides a statistical basis for addressing concerns regarding the reliance of mobile edge compute on more or less advanced natural language processing activities, and it is a belief that the framework may be put into practice.
The negotiation has unveiled the pros and cons of the BlueEdge framework, which are currently supported by extensive statistical proofing coupled with expanded assessment findings.
High performance and efficiency in cleaning process, statistically tested time of completion and memory usage (82.2% accuracy, 95% CI: 78.8–85.6 percentage, 1 s of time consumption, 5 KB memory consumption). Cross-validation in different data partitioning yields the same consistency of performance (81.7%, 2.3%).
Compatibility with multiple platforms, providing flexibility in usage.
The simplicity of use, automated process of preprocessing which does not require direct human involvement but ensures a competitive level of accuracy across varied types of errors (range of 72–95%).
Cost efficiency, since BlueEdge has a customizable interface and does not require purchasing licenses for multiple tools. It achieves statistically significant performance advancements (large effect sizes: Cohen's d = 0.89–1.34) compared to commercial options.
It is, however, necessary to take into account the revision of BlueEdge limitations based on the findings of overall evaluations:
Performance Limits: Resource efficiency and processing ability are inherent trade-offs in mobile edge environments, as evidenced by the accuracy levels of 72–95% across various error types. This application may require higher accuracy rates; which mobile computers do not currently possess.
Domain Specificity: The test was done based on university registration because there was a set pattern regarding the errors. These performance characteristics can vary considerably when applied to other domains, data structures, or languages that present different natural language processing challenges.
Scalability Validation: The scalability and performance stability under heavy computational loads have been proven possible through recent tests conducted using no more than 3615 records but still, they will have to be validated using larger enterprise sizes to validate the projected scalability and performance stability ratings of the system. The case study validation provides evidence of applicability to practice; however, it is limited in scope and must be expanded to organizational validation.
Scalability assessment and limitations
Our experimental validation with datasets of up to 3000 records represents a limitation for assessing large-scale enterprise deployment. However, our architecture design supports larger-scale implementations through theoretical scalability analysis showing memory scaling (5000 bytes per 1000 records requiring approximately 500 KB for 100,000 records), linear processing time projection (approximately 100 s for 100,000 records per device), horizontal scalability enabling multiple edge devices to process different data portions simultaneously, and Firebase real-time database supporting concurrent access and large-scale data synchronization.
Current limitations include dataset size constraints (the most extensive empirical test involves 3000 records), a single-device focus with limited individual mobile device performance evaluation, the systematic evaluation of multi-user scenarios not being comprehensively evaluated, and the analysis of network load under high concurrent access conditions not being extensively tested. Future scalability challenges require investigation of mobile device resource limits under extended operation, network bandwidth bottlenecks, data synchronization complexity with multiple simultaneous operations, and fault tolerance requirements under high-load conditions.
Performance validation framework
The review process developed for the current study provides a strong structure for examining mobile edge computing applications. The high standards of statistics employed, including a sufficient sample size (146 cases), significance testing (p < 0.05), effect size (Cohen d: 0.89–1.34), and cross-validation (CV < 3%), alleviate the imperative issues of performance statements in resource-starved contexts.
These listed performance characteristics (82.2% overall accuracy, 95% CI: 78.8–85.6%) set the bar of what a mobile edge data cleaning can harness in the real-world, and secondly, they validate the technical adequacy of complex NLP tasks on mobile devices. Such validation strategy helps to set an evaluation criterion when it comes to edge computing research and can guide the deployment practitioner through the edge computing features with confidence in real-world use cases.
System explainability and decision transparency
The BlueEdge framework incorporates explainability features through a rule-based architecture and a dual-function normalization and correction approach. The system facilitates transparent decision-making through quantifiable Levenshtein distance calculations (τ = 0.25 threshold) and explicit pattern matching rules. The automatic cleaning process follows a documented set of steps: preprocessing (honorific removal and case normalization), similarity calculation using the Levenshtein distance, and rule-based corrections for various error types. Users understand the effectiveness of cleaning through before-and-after data comparisons, which show how embedded rules automatically correct data quality issues. This rule-based approach offers explainability advantages over black-box machine learning approaches, as users can precisely trace which rules were applied and why specific corrections were made.
Neural network enhancement and advanced development
Building upon the successful rule-based foundation established in this study, we have developed neural network enhancements that significantly expand the framework's capabilities. This advanced development, currently under peer review in another journal, represents a natural evolution of our BlueEdge framework and demonstrates the extensibility of our mobile edge computing approach.
The neural network implementation introduces automated learning capabilities that enable the system to automatically detect and classify 14 distinct data types beyond the current name-focused implementation, adapt to new error patterns through machine learning rather than requiring manual rule programming, self-optimize processing parameters based on data characteristics and accuracy feedback, and handle previously unseen data formats through pattern generalization capabilities. Initial testing achieved 94.2% classification accuracy across diverse data types while maintaining the framework’s core advantage of processing efficiency within mobile device constraints.
This neural network extension demonstrates the framework's evolution from specialized name processing to general-purpose data classification, opening pathways for expansion beyond the 6 name-based duplicate types to 14 diverse data types, automatic discovery of new error types and data patterns beyond predefined rules, adaptive learning from user corrections and feedback, and enhanced accuracy through continuous mobile-based machine learning. The successful integration validates our architectural design decisions and confirms the framework's potential for advanced mobile data processing applications.
Integration capabilities and system compatibility
The current BlueEdge implementation provides foundational integration capabilities through Firebase real-time database connectivity, multi-format data support (Excel, CSV, JSON), and a Python-based open architecture enabling custom connector development. The mobile edge processing model offers enterprise integration advantages, including reduced server load, a 70–80% decrease in data transmission requirements, and enhanced privacy compliance through local processing. However, comprehensive enterprise integration requires the future development of direct database connectors (such as SQL Server, Oracle, and SAP) and formal API frameworks, while maintaining core edge processing benefits.
Research limitations and validity considerations
Dataset scale considerations
The BlueEdge framework demonstrates intelligent data cleaning on mobile devices using 1400 training samples and 3615 real-world university records, achieving a 1-s processing time within 5 KB memory constraints. However, established enterprise tools optimized for larger datasets (> 100,000 records) may demonstrate different performance characteristics. Our mobile-focused evaluation provides strong evidence for edge computing scenarios but may not fully represent enterprise-scale comparisons, highlighting the need for dedicated large-scale validation research.
Domain and context considerations
BlueEdge successfully handles name-based data cleaning with 6 types of duplicate elimination services (different spelling/pronunciation, misspellings, name abbreviations, honorific prefixes, common nicknames, and split names), achieving accuracy rates from 80 to 100%. The university registration dataset represents one application domain focused on personal information, validating effectiveness for academic and similar contexts, but broader domain evaluation would strengthen generalizability claims.
Comparative evaluation considerations
Superior performance over rule-based approaches while operating under severe resource constraints represents genuine technical contributions. Standard configurations for comparative tools reflect typical deployment scenarios, though future work will include optimized baseline comparisons for more rigorous evaluation. Mobile-optimized metrics appropriately measure performance within the intended deployment context, while additional enterprise-focused metrics could complement findings.
Research contribution context
The primary contribution lies in demonstrating intelligent data preprocessing on resource-constrained mobile devices. The BlueEdge framework achieved significant efficiency gains (1 s per edge with 5 KB memory consumption) while maintaining high accuracy in name-based duplicate detection. The framework complements rather than replaces enterprise tools, addressing edge preprocessing challenges where traditional solutions are not deployable due to resource constraints, with practical value confirmed through real-world deployment validation.
Study limitations and future research directions
This paper presents a validation of data clean-up on the mobile edge, in a proof-of-concept nature, which has limitations in scope, enabling future thorough studies.
Comparative Evaluation Scope: We evaluated the feasibility of mobile edge computing by providing a demonstration of its capability using readily available commercial tools in their standard configurations. General comparative study with well-known benchmarking datasets (VLDB/SIGMOD) and optimized settings of the tools, as well as open-source counterparts (OpenRefine, Dedupe), should be considered as a significant future research, which would necessitate some specific comparative research on specific infrastructure and licensing.
Dataset Specificity = The field of university registration of data gives the name-based duplicate search on a domain-specific validation. The generalization to other types of data as well as benchmark data sets will involve a thorough evaluation out of the scope of the present study. Baseline Architecture Limitations: Evaluating hybrid edge-cloud infrastructure and more advanced distributed systems constitutes emergent research opportunities that necessitate the specially developed and evaluated systems and their independent scaling. Such limitations outline concise research avenues for implementing broader comparative works that can potentially extend our baseline paper on mobile edge processing.
Security and privacy overview
Basic security framework
The key security measures embedded in BlueEdge include data communication via HTTPS/TLS encryption, user authorization using Firebase Authentication, and local processing to minimize the amount of data exposed. The framework adheres to the principles of privacy-by-design, emphasizing explicit user consent and data minimization.
Privacy architecture
Direct personal information is stored in mobile gadgets during processing. The results are also processed and cleaned before being transmitted to the cloud storage (without personal identifiers), which allows for a 70–80% decrease in data exposure compared to cloud-first strategies.
Security scope limitations
Close security review with formal privacy impact analysis, extensive attack vectors analysis, and security compliance checking mechanisms is a specialized area of cybersecurity research and it involves specialized security skills and pen-testing tools as well as legal frameworks on security compliance that cannot be included in this paper. That analysis would amount to individual research projects involving the expertise in the field of cybersecurity, and the legal sphere.
Future research directions
Future research directions include large-scale evaluation across multiple domains beyond academic datasets, integration with enterprise tools for hybrid edge-cloud architectures, development of specialized models for different industry domains, investigation of federated learning approaches for privacy-preserving model updates, and exploration of new mobile data processing applications. Each enhancement requires rigorous testing to ensure compatibility with mobile resource constraints while maintaining established performance characteristics.
Interpretation of results
Results should be interpreted within the mobile edge computing context, where they demonstrate significant technical achievements, establishing a foundation for advanced mobile data processing capabilities. The framework enables intelligent data preprocessing under severe resource constraints, achieving high accuracy while providing substantial efficiency gains. Performance advantages are particularly significant in scenarios with limited connectivity, privacy requirements, and resource constraints, representing essential and growing application domains in edge computing.
Conclusion
Summary of the study
We investigated the BlueEdge framework, a novel approach for data cleaning that operates on the mobile edge. The study evaluated BlueEdge's performance compared to traditional data warehouse tools in terms of time efficiency and memory load, assessing its effectiveness in detecting and processing legitimate variations and actual errors, particularly in areas of missing data and duplication. The experiment utilized three datasets from reliable sources, using multiple evaluation criteria for a comprehensive performance assessment.
Key Contributions and Performance Achievements
The real-world validation through an IT services company case study provides compelling evidence of BlueEdge's practical effectiveness beyond laboratory conditions, with documented improvements of 75–97% in processing speed and 50–92% in resource efficiency. The systematic resolution of operational challenges and successful integration demonstrate the framework's adaptability and extensibility.
BlueEdge demonstrated remarkable efficiency, achieving completion times as low as 1 s per edge with minimal memory consumption (5000 bytes) on mobile devices, making it a resource-efficient solution. The cost-effectiveness of providing cleaning services free of charge enhances accessibility to broader user groups. The user-friendly automated preprocessing process simplifies cleaning tasks without requiring complex user interactions.
BlueEdge consistently achieved high detection rates across various error categories: different spellings and pronunciations (90%), misspellings (80%), name abbreviations (100%), honorific prefixes (100%), common nicknames (90%), and name splitting (100%). These results effectively identify and address data errors, particularly in missing data and duplication.
Advanced development and machine learning integration
Building on the foundation established in this study, we have successfully implemented neural network enhancements currently under peer review. Key achievements include 94.2% average accuracy across 14 distinct data types, maintained 1-s processing target with 87 ms average classification time, 80% accuracy maintained with 20% character corruption, 4300-byte model within original 5000-byte memory constraint, and consistent cross-platform performance. These results demonstrate substantial advancement from the rule-based approach, validating the framework's evolution toward intelligent, adaptive data cleaning capabilities.
Future directions and research areas
Technical enhancements
Future research directions include expanding error category coverage, refining normalization and correction algorithms, and exploring integration with other data processing frameworks. We aim to enhance name-matching capabilities by integrating fuzzy-matching techniques, including phonetic algorithms (Soundex, Metaphone) [9], approximate string matching (Jaro-Winkler) [23], and embedding-based similarity models, while maintaining mobile edge processing constraints.
Scalability and enterprise integration
Future development will focus on extensive dataset validation (10,000 to 100,000 + records), concurrent user architecture development with multi-user access management, a distributed edge computing framework enabling collaborative processing, and comprehensive enterprise integration, including RESTful API development and direct database connectors (SQL Server, Oracle, PostgreSQL).
Security and privacy enhancements
Enhanced security features will include comprehensive audit trail implementation, advanced local encryption, enhanced anonymization techniques, automated compliance reporting, multi-level privacy controls, and secure multi-party processing capabilities while maintaining the fundamental privacy advantage of local raw data processing.
Power consumption optimization
Power efficiency enhancements will include dynamic CPU scaling based on battery level, battery-aware processing intensity adjustment, sleep mode optimization, efficient data structures, batch processing for startup cost reduction, and hardware-specific optimizations for different mobile CPU architectures.
Research significance and impact
The BlueEdge framework represents a significant advancement in data cleaning by operating on the mobile edge [17]. Its efficiency, cost-effectiveness, and superior variation identification and error detection performance make it a valuable tool for data preprocessing tasks. The primary objective of demonstrating the feasibility of performing data cleaning directly on mobile devices was successfully achieved, validating that lightweight rule-based methods can deliver accurate, efficient, and privacy-preserving data normalization within strict mobile constraints.
The framework successfully enables intelligent data preprocessing under severe resource constraints while providing substantial efficiency gains. Performance advantages are particularly significant in scenarios with limited connectivity, privacy requirements, and resource constraints, which represent significant and growing application domains in edge computing. By continuing to refine and expand BlueEdge’s capabilities through the outlined research directions, we can unlock its full potential in facilitating efficient and accurate data-cleaning processes across various domains.
Acknowledgements
We want to express our sincere gratitude to all individuals who have contributed to the completion of this research.
Author contributions
All authors have contributed equally to the conception and design of the research project, development of the "Blue Edge" application, conducting experiments, analyzing the data, and writing and revising the manuscript. [NE] conceived and designed the study, conducted data collection and analysis, and contributed to the writing of the manuscript. [SE] assisted in study design, performed data analysis, and contributed to the interpretation of the results. [HE] provided critical revisions and intellectual input throughout the research process. All authors have read and approved the final version of the manuscript.
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). Authors confirm that no external funding or financial support was received for this research.
Availability of data and materials
The entire implementation of the BlueEdge framework, source code, datasets, performance benchmarks and technical documentation are freely available at: https://github.com/nagwaelmobark/BlueEdge-Framework. All algorithms, along with the code of the mobile application, test cases (n = 146), and validation datasets used in the research, are present in the repository so that the results can be fully reproduced.
Declarations
We declare that all the information provided in this manuscript is accurate and complete. Any errors or omissions are our responsibility.
Ethics approval and consent to participate
This research has been conducted per ethical principles and guidelines.
Consent for publication
All individuals mentioned as authors have reviewed and approved the final version of the manuscript and have agreed to its submission to the Journal of Big Data for publication. Furthermore, we confirm that this manuscript has not been previously published and is not currently under consideration for publication.
Competing interests
All authors in this study declare that they have no competing interests that could affect the submitted manuscript.
Abbreviations
Internet of Things
Cloud computing
Mobile edge cloud computing
Natural language processing
Natural language toolkit
Extraction, transformation, loading
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Abbas, N; Zhang, Y; Taherkordi, A; Skeie, T. Mobile edge computing: A survey. IEEE Internet Things J; 2017; 5,
2. Akhbardeh F. NLP and ML methods for pre-processing, clustering and classification of technical logbook datasets. July 2022. https://scholarworks.rit.edu/theses/11227
3. Bilenko M, Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. p. 39–48.
4. Bird S. NLTK: The natural language toolkit. COLING/ACL 2006—21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, proceedings of the interactive presentation sessions. 2006, pp 69–72.
5. Bonomi F, Milito R, Zhu J, Addepalli S. Fog computing and its role in the internet of things. In: MCC'12—proceedings of the 1st ACM mobile cloud computing workshop. 2012. p. 13–15. https://doi.org/10.1145/2342509.2342513
6. Bramantoro, A. Data cleaning service for data warehouse: an experimental comparative study on local data. Telkomnika (Telecommunication, Computing, Electronics, and Control); 2018; 16,
7. Christen, P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng; 2012; 24,
8. Cohen, WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst; 2003; 18,
9. Dong W, Douglis F, Reddy S, Li K, Shilane P, Patterson H. Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of FAST 2011: 9th USENIX conference on file and storage technologies, November 2017. p. 15–29.
10. Drolia U, Martins R, Tan J, Chheda A, Sanghavi M, Gandhi R, Narasimhan P. The case for mobile edge-clouds. In: Proceedings—IEEE 10th international conference on ubiquitous intelligence and computing, UIC 2013 and IEEE 10th international conference on autonomic and trusted computing. ATC; 2013. p. 209–215. https://doi.org/10.1109/UIC-ATC.2013.94
11. Kobzdej P (Adam MU, Waligóra D, Wielebińska K, Paprzycki M (n.d.). Parallel application of levenshtein distance to establish similarity between strings.
12. Kolb, L; Thor, A; Rahm, E. Dedoop: efficient deduplication with Hadoop. Proc VLDB Endow; 2012; 5,
13. Li, T; Sahu, AK; Talwalkar, A; Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag; 2020; 37,
14. Müller H, Freytag J. Problems, methods, and challenges in comprehensive data cleansing. informatics reports. Institute for Computer Science, Humboldt University of Berlin, HUB-IB-164, Humboldt University Berlin; 2003. p. 1–23. http://www.dbis.informatik.hu-berlin.de/fileadmin/research/papers/techreports/2003-hub_ib_164-mueller.pdf
15. Elmobark N, El-ghareeb H, Elhishi S. Perspectives on the integration of the internet of things and fog computing for geospatial big data analytics. Mach Intell Res. 2023;17(1):9515–28.
16. NLTK: nltk.metrics. Distance module. (n.d.). Retrieved 30 May 2023, from https://www.nltk.org/api/nltk.metrics.distance.html
17. Rahm, E; Do, H. Data cleaning: problems and current approaches. IEEE Data Eng Bull; 2000; 23,
18. Rancea, A; Anghel, I; Cioara, T. Edge computing in healthcare: innovations, opportunities, and challenges. Future Internet; 2024; 16,
19. Satyanarayanan, M; Simoens, P; Xiao, Y; Pillai, P; Chen, Z; Ha, K; Hu, W; Amos, B. Edge analytics in the internet of things. IEEE Pervas Comput; 2015; 14,
20. Shi, W; Cao, J; Zhang, Q; Li, Y; Xu, L. Edge computing: vision and challenges. IEEE Internet Things J; 2016; 3,
21. Ur Rehman, MH; Jayaraman, PP; Malik, UR; S., Ur Rehman Khan, A., & Gaber, M. M. ,. RedEdge: a novel architecture for big data processing in mobile edge computing environments. J Sens Actuator Netw; 2017; [DOI: https://dx.doi.org/10.3390/jsan6030017]
22. Roman, R; Lopez, J; Mambo, M. Mobile edge computing, fog et al.: a survey and analysis of security threats and challenges. Future Gener Comput Syst; 2018; 78, pp. 680-698.
23. Zou, H; Yu, Y; Tang, W; Chen, HWM. Flexanalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Res; 2014; 1, pp. 4-13. [DOI: https://dx.doi.org/10.1016/j.bdr.2014.07.001]
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.