Enhancing performance of E-Government information systems with SSD-based Hadoop mapreduce

Abstract

E-government applications generate and process large volumes of heterogeneous data that demand high-throughput and low-latency computation. Although Hadoop MapReduce is commonly used for such tasks, its performance is often limited by disk I/O constraints and network delays during the shuffle phase. This study proposes a data address-based shuffle mechanism optimized for Hadoop clusters equipped with Solid-State Drives (SSDs), aiming to enhance data processing performance in e-government applications. The mechanism introduces three key components: address-based sorting, address-based merging, and pre-transmission of intermediate data, which collectively reduce disk I/O and network transfer overhead. Experimental evaluations using Terasort and Wordcount benchmarks demonstrate execution time reductions of 8% and 1%, respectively, with statistical significance confirmed through 95% confidence intervals. Scalability assessments on a simulated 50-node cluster and energy profiling further validate the approach, showing improved performance, reduced network congestion, and a 31% decrease in energy consumption compared to HDD-based systems. The findings establish the proposed mechanism as a cost-effective and efficient solution for large-scale data processing in public sector computing environments.

Full text

Translate

Turn on search term navigation

Introduction

E-Governments (i.e. the application of ICTs in government) is important as it improves the efficiency of public sector administration and increases transparency and accountability of the decision-making process¹. E-government helps citizens to engage in social and political life through enabling electronic transaction and increasing service quality¹. Moreover, e-governance modernise the operations of the government, providing citizens with the ability to access government services which enhance the transparency of the public financial sector².

Recently, big data and analytics have become a major trend for e-government, which is rapidly adopting soft computing techniques like big data sentiment analysis to facilitate decision making and delivery of services³. Governments around the world are shifting towards data analytics to tackle their challenges and utilize the potential of big data for better efficiency, transparency and democratic governance⁴. With the increasing in volume of e-government data daily, e-government services face operational challenges in handling very large datasets, which are complex – for example, citizen registrations, financial transactions, geospatial data, and administrative files^{5, 6–7}.

This calls for robust infrastructure and highly sophisticated processing techniques for handling big data in e-government setting^8,9. Traditional procedural data processing has limitations when dealing with e-government especially when parallel processing, real-time analytics and scalability are needed^9,10. This has led to the potential of distributed computing frameworks such as Hadoop MapReduce that have a built-in scalability and fault tolerance characteristics to replace the traditional procedural data processing approaches.

Hadoop MapReduce, especially as a distributed paradigm in processing data-intensive applications, has become one of the basic skeletons in the e-government technology architecture since it can distribute data-intensive tasks on a cluster of commodity hardware¹¹. Despite its wide use, MapReduce exhibits performance challenges due to I/O operations and network latency in e-government applications owing to the characteristics of data storage and transmission¹². In addition, the high amount of data produced while shuffling between map and reduce tasks during task execution is also a major challenge.

For instance, take example of a government agency that wants to analyse massive amounts of citizen data to identify patterns and trends to inform policy formulation and decisions. It is provided with a MapReduce job on Hadoop. The heterogeneity of the bulk of the data, such as citizen demographics, health records and social welfare data, makes it complex to complete the job. When jobs are executed, performance is much based at how data is shuffled between the map and reduce tasks. The massive amounts of data being sent cause the I/O operations and network latency, which are crucial metrics for job performance. The shuffle phase is the most critical part of the job - not only because results are influenced by how well the data are shuffled and sent to map and reduce tasks, but also because, in an e-government environment, there is a lot at stake for the government to process government data in a timely and correct manner. When it is inefficient, the government’s operations and service-provision are affected. This is especially critical given that most of the data and information provided to citizens comes from government data.

The realization of this is a challenge and demands for innovative solutions as e-government information systems have different requirements and constraints compared to traditional information systems. Moreover, recently literature has emphasized that, optimising e-government data processing workflow to minimise latency, maximise throughput, and the ability to support the large scale of the system is a critical task that needs to be addressed¹³. By considering the increasing demand for high performance and having features to balance time and cost in the e-government context, the present study attempts to address the performance issues related to Hadoop MapReduce deployments. Specifically, in this study, efforts are made to introduce shuffle mechanisms to be specifically optimised for Solid-State Drive (SSD) equipped Hadoop clusters to minimise data movement overhead in the shuffle phase and eventually reduce the network latency in the system. This will contribute to the efficient usage of available network resources, minimise data movement overhead, and improve the response time and improve the overall data processing system performance.

The key contributions of this study are as follows: The study.

Proposes a novel data address-based shuffle mechanism tailored for SSD-based Hadoop clusters used i e-government environments.
Reduces shuffle-phase I/O operations by 40% and network data volume by 80% through address-based sorting and transmission.
Demonstrates performance improvements validated by benchmarks (8% reduction in Terasort, 1% in Wordcount) with statistical confidence.
Introduces an early transmission scheme that overlaps shuffle and reduce phases, minimizing delay.
Validates scalability to 50-node clusters and confirms 31% energy efficiency improvement over HDD-based setups.

This study contributes to the area of e-government information systems by proposing methods to improve the execution time and efficiency of the Hadoop MapReduce framework for applying to e-government scenarios. This paves the way for a smooth delivery of government services and operations for the citizens, thus empowering e-government to be effective and responsive to the community that it serves. The rest of this paper is organised as follows: Literature Review section presents the review of the literature and then follows the the proposed model section; The next section discusses the experiment that has been taken for the study which follows the study findings . The paper concludes with conlcusion and recommendations for future work.

Literature review

Overview of distributed computing frameworks in e-Government

Distributed computing frameworks are essentially indispensable in e-government scenarios when it comes to processing the diverse and massive amount of data associated with the subject¹⁴. One of the most ubiquitous distributed computing frameworks in e-government is Hadoop MapReduce, which is highly fault-tolerant and horizontally scalable. Hadoop MapReduce is a software framework that leverages the MapReduce programming model introduced by Google in 2004 to process data on a cluster of low-cost commodity nodes in parallel. It relies on two key components: The Map phase that categorises and partially processes the data in parallel. The Reduce phase that consolidates intermediate results together into final outputs^15,16. Hadoop’s distributed file system (HDFS) manages the storing and retrieval of data across thousands of nodes in the cluster as reliably as possible¹⁷. Similarly, Hadoop’s fault-tolerance mechanism also provides an additional fault-resilience vis-a-vis hardware failures, which underpin the robustness of e-government services.

One of the major bottlenecks that a MapReduce can face is I/O operations and network latency. Moreover, the shuffle phase, the data-moving phase from the map to the reduce phase, can worsen this scenario. Therefore, there is a need for further examination of other techniques and optimisation approaches and how they can overcome these issues and make MapReduce more usable in the context of e-government.

Apache Spark is a distributed general-purpose data-processing framework, which mainly aims to address and improve the performance and usability challenges present in big data processing. The core strategy of Spark to leverage in-memory computing in delivering speedup over the disk-based processing frameworks MapReduce (and Hadoop)^18,19. This reduces the inherent latency of reading and writing to a disk while performing tasks such as machine learning and graph processing. Additionally, Spark offers a high-level programming model that one can utilise to perform batch processing, streaming, machine learning and graph processing. Internally to Spark, distributed resilient and fault-tolerant processing is performed by executing arbitrary functions on a distributed abstraction known as resilient distributed dataset (RDD), with a broad ecosystem of its own APIs and libraries that one can leverage to create data analysis applications quickly.

Through Apache Spark, the time of e-government data processing can be sped up, which will thus in turn help to enable real-time analysis of data from government databases. The capability of the database to handle heterogeneous data types and processing workloads offer great service to e-government, which is complex and entails different applications and approach. To integrate the Apache Spark in e-government systems, many factors must be considered, which include providing enough resources, securing the database from intruders, and making it easy to install and interact with other existing infrastructure¹⁹. Though the Apache Spark technology has great potential in changing how data processing is carried out in e-government, it must be assessed in view of what the e-government system requires.

While previous techniques like OPS²⁵ and Magnet²⁴ address shuffle inefficiencies through memory or scheduling optimizations, they are designed for Spark or optimized Hadoop environments with high-resource availability. The novelty of the current approach lies in combining I/O and network optimization without memory-heavy architectures, employing lightweight address-based sorting and early transmission. Unlike SSD-accelerated shuffle methods like Venice, which improve disk throughput alone, this approach improves end-to-end shuffle phase efficiency tailored for budget-constrained public sector clusters.

Hadoop mapreduce and data processing efficiency in e-Government

E-government datasets often contain huge volumes of data, have a wide array of data types (structured, semi-structured and unstructured; ranging from citizen records to geospatial information and even administrative documents), and are generated at very high speeds²⁰. These combinations often make the task of processing these data sets using Hadoop MapReduce frameworks very challenging. For example, the resource intensity of the MapReduce job makes the job expensive to run, since it often requires vast amounts of computational resources to complete, while the massiveness of e-government datasets, coupled with data skew, make the processing task even more challenging^17,27. Data skew is a phenomenon where certain data keys (or values) are over-represented in the given dataset. This happens when some key-value pairs dominate others in a dataset, which can affect the distribution of the data across the mapper tasks and lead to compute energy imbalance. Such imbalance can led to prolonged task execution time since some tasks might take longer than others to complete, thereby creating processing bottlenecks.

Further, latency is caused by the disk I/O bottlenecks, once data is spread across a Hadoop MapReduce cluster, new performance bottlenecks appear. First, network communication overhead and disk read/write operations introduce data-processing latency and cause further bottlenecks. The network data transfer to move data between the map and reduce operations. Moreover, the I/O access to read and write data to disk during job execution introduce high network data transfer and create I/O contention for disk access. This leads to further latency and causes I/O bottlenecks, which results in longer processing times and slow processing of jobs.

Role of SSDs in improving data processing efficiency

Solid-State Drives (SSDs) are important components of the processing efficiency of Hadoop MapReduce for e-Government since faster processing of data is important in the Hadoop implementation. SSDs have technical advantages over traditional hard disk drives (HDDs) and therefore are very important for efficient processing (Fig. 1):

Increased Read/Write Throughput: SSDs use the technology of flash memory, which means data transfer is fast compared to HDDs. The enhanced data rate allows for faster access and transmission of data, leading to reduced latency and an increase in processing capabilities²¹.
Minimised latency: SSDs have shorter latency than HDDs due to the absence of mechanical components. This lower latency means access operations will be quicker, which helps to enhance data processing speed.
Random access: SSDs provide random access to memory cells holding data, unlike HDDs that rely on spinning disks and seek times for accessing data. Because of this inbuilt random-access property, SSDs can end up processing data much faster than HDDs, especially when the workloads regularly access several widely dispersed data points.
Endurance and Steadiness: SSDs are much more durable and reliable that HDDs because they have fewer moving mechanical parts, thereby reducing mechanical failure. This help the system to be more reliable leading to less data breaches and system downtime to keep the data flow being ongoing²².
Energy Efficiency: SSD workloads consume less power compared to HDD workloads and are energy-efficient than an HDD workload. Primarily, this energy-efficient characteristic is beneficial in various data processing infrastructures by increasing their scalability and sustainability while also reducing their running costs²³.

Fig. 1 [Images not available. See PDF.]

SSDs Technical Advantages.

(Source: Developed for this study).

For data-intensive applications in e-government environments, when SSDs are embedded into Hadoop MapReduce clusters the latency for data processing have the potential to be reduced, and the utilisation of I/O resources to increase. This is because MapReduce jobs can read data and write data to SSDs at high speed and low latency, and access to data will be faster and more efficient as a result. With this, completion times for MapReduce jobs can be reduced, and the performance of MapReduce jobs will also be improved.

One Hadoop integration strategy is to introduce SSD-based storage devices as a primary or secondary storage tier in a Hadoop MapReduce cluster. In a tiered storage architecture, SSDs can serve as the primary storage of the hot blocks of frequently accessed data while HDDs can handle long-term storage and archive requirements. In a tiered storage architecture, the placement decision of data and the access pattern of data can be optimised. Frequently accessed data are stored and reside in a higher I/O performance SSD storage tier in MapReduce clusters so that frequently required data can be retrieved under the MapReduce processing as soon as possible. SSDs can also be used as a cache device so that data access and I/O operations in MapReduce clusters can be accelerated. Cache device, for example, SSDs, can be used in MapReduce clusters to accelerate data access and I/O operation. As an example, a cache device (such as SSDs) can store the frequently accessed data blocks or metadata in MapReduce clusters. Data access latency can be reduced and MapReduce job throughput and overall performance can be improved.

Proposed model

In the traditional Hadoop MapReduce framework, the shuffle phase begins after the map tasks generate intermediate key-value (< K, V>) pairs. These pairs are initially buffered in memory. Once the memory buffer reaches a predefined threshold, the data is sorted by key and written to local storage as spill files. This process may be repeated several times depending on the volume of intermediate data, resulting in multiple spill files per map task. After all map tasks complete, the spill files are merged into a single map output file per task. This merged file is then served to the appropriate reduce tasks via remote fetch requests. During this process, the reduce tasks request and copy the merged output files across the network from each mapper, introducing significant network I/O, especially in clusters with high data volumes or skewed key distributions.

Once received, reduce tasks perform another round of merging to combine all incoming map output files, maintaining key-based sorting. The final merged file is then passed to the reduce function for processing. This multi-stage procedure involves several redundant read/write cycles: writing of spill files, reading for merge, writing the merged file, and then reading again for transmission. It also results in serialization delays due to the sequential nature of map completion, merge, and transfer phases. Data transfer only begins after the map-side merging is finished, leaving reduce tasks idle during this interval. These design characteristics contribute to substantial I/O overhead, increased disk contention, and network congestion, all of which undermine the performance of large-scale e-government applications that process multi-terabyte citizen data registries or financial logs.

The SSD-based Hadoop system comprises a hierarchical structure consisting of a master node, slave nodes, and switch hubs. Within this architecture, a single master node governs the system, while multiple slave nodes execute tasks. The number of switch hubs vary based on the system’s scale and configuration. Each master and slave node are equipped with independent processing units, memory modules, and storage devices, with a notable emphasis on employing SSDs for storage (Fig. 2).

Fig. 2 [Images not available. See PDF.]

SSD-Based Hadoop System Architecture.

(Source: Developed for this study).

The system is underpinned by independent operating systems installed on each node. These operating systems utilize local file systems to facilitate read and write operations on SSDs. Concurrently, the Hadoop framework, encompassing the Hadoop Distributed File System and Hadoop MapReduce components, is deployed across all nodes, facilitating seamless coordination, task allocation, and data exchange among the distributed entities. The Shuffle phase of Hadoop MapReduce encapsulates a series of complex processes essential for data synchronization and redistribution (Fig. 3).

Fig. 3 [Images not available. See PDF.]

SSD-Based Hadoop MapReduce Phase.

(Source: Developed for this study).

Initially, the map function within map tasks processes input data, generating < K, V > pairs that are stored in memory buffers. As these buffers approach a predefined threshold, the accumulated < K, V > pairs undergo sorting based on the key (K) attribute within the memory buffer. The sorted pairs are then persisted to the storage device of the executing node in file format.

Synchronously, the map algorithm continues to produce additional pairs, which are stored in the remaining memory buffer space. Upon buffer saturation, the sorting and storage process is triggered, potentially resulting in the creation of multiple spill files. These spill files are subsequently merged into a unified map output file, maintaining the sorted order based on the key attribute. Meanwhile, reduce tasks retrieve pairs from the map output files via network communication, storing them in the local storage device. The retrieved pairs are merged into a single file, adhering to the sorted order based on the key attribute. To optimize efficiency, the number of files merged at once is restricted, necessitating multiple merging iterations depending on the volume of map output files. Eventually, the fully sorted and merged file is processed by the reduce function to produce the final output data.

To address the causes of delay in the Shuffle phase of traditional Hadoop MapReduce within an SSD-based Hadoop MapReduce system, a data address-based shuffle mechanism is proposed. This mechanism consists of (1) a data address-based sorting method, (2) a data address-based merging method, and (3) a method for pre-transmitting map output data.

In this study, SSD-based data address shuffle mechanism is preferred over Apache Spark. This is driven by the specific operational constraints of these environments, particularly limited memory availability and the need for cost-effective scalability. Spark’s in-memory processing, while effective in bypassing shuffle inefficiencies through resilient distributed datasets (RDDs), requires substantial RAM—typically 48 GB per node for Terasort benchmarks with 10 GB datasets—to cache intermediate data, as noted by¹⁸. In contrast, this study mechanism leverages SSDs’ high sequential read/write speeds (up to 550 MB/s for SATA SSDs versus 100 MB/s for HDDs) to optimize sorting and transmission with minimal memory overhead (32 GB per node). By sorting lightweight address information (16 bytes per key-value pair) rather than full data pairs (often 100 + bytes), the approach reduces shuffle-phase I/O by 40% and network data volume by up to 80%, as validated in our experiments. This makes it particularly suited for e-government clusters processing massive datasets, such as citizen registries or financial transactions, where hardware upgrades are constrained by budget limitations.

Data Address-Based Sorting.

As mentioned earlier, in an SSD-based system, there is no need for the pairs transmitted to reduce tasks to be sequentially stored. Instead of sorting massive pairs, addresses for these pairs are generated and sorted. The address information for pairs indicate the spill file number and the offset position within the spill file. This address information, being relatively small in size, is stored in additional memory buffers or storage devices. To convey sorting information to reduce tasks, this address information is transmitted. Previously, pairs were transmitted in a sorted format, but with this proposal, they are transmitted in the spill format generated by map tasks. Reduce tasks can reference the address information for pairs to read the pairs in the sorted order from the spill files.

The novelty of data address-based shuffle mechanism is underscored by its distinct approach compared to existing SSD-based shuffle optimizations, such as those proposed by²⁵, which primarily accelerate local I/O operations through SSD caching. While these methods improve read/write throughput (e.g., by 20% for random access), they often overlook network transmission bottlenecks, a critical issue in distributed e-government systems handling terabytes of heterogeneous data. The proposed mechanism combines I/O and network optimizations by sorting and transmitting compact address information (e.g., spill file offsets averaging 1 MB per GB of shuffle data) instead of raw key-value pairs, reducing network latency by 30% in Terasort benchmarks. Unlike prior SSD-based techniques that focus solely on storage-tier enhancements, our approach integrates pre-transmission of spill files, enabling reduce tasks to commence earlier, thus shortening the shuffle phase by 8% overall. This dual optimization positions our work as a pioneering solution tailored to the high-throughput, low-latency demands of e-government data processing.

Data Address-Based Merging.

In the traditional Hadoop MapReduce shuffle mechanism, the merging process of map task spill files creates a single merged file containing pairs. However, in the proposed mechanism, only the address information for pairs from each spill file is sorted, resulting in the sorting of output data for the respective map task. Consequently, the number of I/O operations for pairs is reduced from two writes, and one read to one write for keys (K) and one read for values. Although storing address information requires additional writing, it is relatively small compared to pairs, reducing the overall I/O operations and size for storage devices, thus decreasing the delay in the shuffle phase.

Additionally, in the merging process of reduce tasks, the traditional Hadoop MapReduce shuffle mechanism incurs one read and one write for each pair, whereas the proposed data address-based shuffle mechanism involves one read for position information and one write for key, resulting in similar access methods to storage devices (Fig. 4). Thus, the time for merging one pair in the traditional Hadoop MapReduce shuffle mechanism is expressed as follows:

Similarly, the time for merging one < K, V > pair in the proposed data address-based shuffle mechanism is expressed as follows:

A summary of the notation used is provided in Table 1.

Table 1. Study notation used.

Notation	Description
S_key	The size of key
S_value	The size of value
S_Addr	The size of address
T_SeqR	The throughput of sequential read
T_SeqW	The throughput of sequential write
T_RanR	The throughput of random read
T_RanW	The throughput of random write

Fig. 4 [Images not available. See PDF.]

Hadoop MapReduce Shuffle Optimization.

(Source: Developed for this study).

Pre-transmission of Map Output Data.

In the traditional shuffle mechanism, the merged map output files were transmitted to reduce tasks only after the merging of spill files from map tasks was completed. However, the proposed mechanism suggests transmitting < K, V > pairs to reduce tasks in the form of spill files and address files, without creating merged data. Therefore, there is no need to wait until merging is completed before starting transmission. Instead, reduce tasks can retrieve these spill files as they are completed. The transmission of these spill files can occur concurrently with the processing of subsequent spill files or the merging of address files. Consequently, by copying only the address file, which is smaller than the map output files, immediately after the map phase, only a minimal amount of data needs to be transmitted before the start of the reduce phase. As a result, the waiting time between the completion of the map phase and the initiation of the reduce phase is reduced, thereby decreasing the delay in the shuffle phase.

Experiment

To test whether the proposed technique works, the study created a dedicated experiment with a configuration designed specifically to the research goals. It set up the experiment on a five-server cluster with each server fully equipped. Each server was built with Intel i7-4790 processor with four cores of 3.6 GHz giving the necessary compute power for the study.

Additionally, each node was configured for memory and attention was paid to the DDR3 DRAM modules (collectively 32GB in capacity) which was determined to be sufficient to house the data processing requirements for the study. Storage was a combination of the newest technology SATA3.0 solid state drives (SSDs) of 500GB capacity, and large capacity SATA3.0 hard disk drives (HDDs) of 1 TB capacity; this combined storage approach is designed to be a hybrid for speed (SSDs) and storage (HDDs) for data processing needs of the experiments. Network connectivity is crucial to any distributed computing environment to support the need for speed and enable collaboration. The study employed a high speed 10Gbps Ethernet connection between nodes in the cluster to enable communication and information flow among distributed computing resources.

The software environment and versions used in this study were as follows: Ubuntu 14.04 (https://releases.ubuntu.com/14.04/) running the Linux 3.13.0 kernel; the Hadoop MapReduce framework, version 3.3.1 (https://hadoop.apache.org/docs/r3.3.1/), into which the proposed shuffle mechanism was integrated; the SimGrid simulator, version 3.29 (https://simgrid.org/), for 50-node scalability experiments; and Python’s SciPy library (https://www.scipy.org/).The study selected Hadoop because its middleware ecosystem is widely used in large-scale distributed computing research and industry practices. In the Hadoop environment, each Map task number was 8, and each Reduce task number was 4. The HDFS block size and replication factor were set at 128 MB and 3 respectively. Table 2 summarizes the experimental configurations and assumptions.

Table 2. Experimental configuration and Assumptions.

Component	Parameter/Setting
Number of Nodes	5 physical nodes; simulated up to 50 nodes
Processor	Intel i7-4790, 3.6 GHz, 4 cores
RAM	32 GB DDR3 per node
Storage	500 GB SATA SSDs; 1 TB SATA HDDs
Network	10 Gbps Ethernet
Hadoop Version	3.3.1
Benchmarks	Terasort (10 GB), Wordcount (1 GB)
Replication Factor	3
Block Size	128 MB
Evaluation Metrics	Execution Time, Shuffle Time, Energy, Throughput, SSD Writes

The study ran the Terasort and Wordcount benchmarks as the input to judge the data address-based shuffle mechanism. These two benchmarks are popular and receive general acceptance in the Hadoop ecosystem as precise metrics for the performance and scalability. The results of the benchmarks indicate the workable effectiveness of the data address-based shuffle mechanism and the actual impacts in the real world (Table 3).

Table 3. Study notation used.

Benchmark	Input Size	Shuffle Size	Reduce Size
Terasort	1	1	1
Wordcount	1	0.07	0.09

Statical analysis was conducted to validate the performance of Terasort and Wordcount benchmarks. The experiment involved running each benchmark 20 times on the five-node SSD-based Hadoop cluster (configured with Intel i7-4790 processors, 32 GB DDR3 RAM, and 500 GB SATA SSDs) using Hadoop 3.3.1 with a 10 GB dataset for Terasort and a 1 GB dataset for Wordcount. Execution times were recorded, and statistical metrics, including mean, standard deviation, and 95% confidence intervals, were calculated using Python’s SciPy library to validate the significance of the performance gains.

Furthermore, the experiment was conducted to test the scalability of the proposed data address-based shuffle mechanism beyond a five-node cluster, a comprehensive scalability analysis experiment was conducted. The experiment utilized the SimGrid framework (version 3.29) to simulate a 50-node Hadoop cluster, extrapolating from the initial five-node setup. Each node was configured with an Intel i7-4790 processor, 32 GB DDR3 RAM, and a 500 GB SATA SSD, connected via a 10 Gbps Ethernet with a fat-tree topology to mimic real-world data center conditions. The Terasort benchmark was employed with a 100 GB dataset—scaled proportionally from the 10 GB dataset used in the smaller cluster—to maintain workload consistency. Key performance metrics, including execution time, shuffle time, network throughput, and SSD write operations, were measured, with simulations repeated 10 times for statistical reliability.

The energy consumption comparison experiment was conducted using two Hadoop clusters: one SSD-based and one HDD-based, each with five nodes. Both clusters shared identical hardware configurations, except for the storage type: SSD nodes used 500 GB SATA SSDs, while HDD nodes used 1 TB SATA HDDs. Each node featured an Intel i7-4790 processor and 32 GB of DDR3 RAM, connected via a 10 Gbps Ethernet network. The study employed the Terasort benchmark with a 10 GB dataset to simulate a typical e-government data processing workload. Energy consumption was measured with a WattsUp Pro power meter, recording the total power draw of each cluster during benchmark execution. The experiment was repeated 10 times for each setup to ensure statistical reliability, and average values were calculated (Fig. 5).

Fig. 5 [Images not available. See PDF.]

Experiment environment.

(Source: Developed for this study).

Findings

To evaluate the performance, a comparative study was carried out between the traditional shuffle mechanism of Hadoop Map Reduce and the newly proposed data address-based shuffle mechanism using a benchmark tool known as Terasort. Figure 6 shows that using the data address-based shuffle mechanism reduces the average time of execution of the Terasort benchmark by 8% which means it improves performance as the data address-based shuffle mechanism was used.

Fig. 6 [Images not available. See PDF.]

Terasort performance comparison of traditional shuffle mechanism and data address-based shuffle mechanism.

The study conducted a comparative analysis of the execution times between the traditional shuffle mechanism in Hadoop MapReduce and the proposed data address-based shuffle mechanism, focusing on the Wordcount benchmark. The findings, in Fig. 7, revealed a marginal decrease of approximately 1% in the average execution time.

Fig. 7 [Images not available. See PDF.]

Wordcount Performance comparison of traditional shuffle mechanism and data address-based shuffle mechanism.

This variance in performance can be elucidated by the operational characteristics of the proposed data address-based shuffle mechanism. This mechanism optimizes the handling of data during the shuffle phase, resulting in significant performance enhancements in benchmarks characterized by shuffle data sizes akin to the input data, such as the Terasort benchmark. Conversely, in benchmarks like Wordcount, where shuffle data sizes are relatively smaller compared to the input data, the observed improvement is less pronounced. The statistical analysis validation of the performance improvements achieved by the proposed data address-based shuffle mechanism is presented in Table 4.

Table 4. Statistical analysis validation.

Benchmark	Traditional Execution Time (s)	Proposed Execution Time (s)	Reduction (%)	Standard Deviation (s)	95% Confidence Interval (%)
Terasort	100	92	8	1.2	[7.5, 8.5]
Wordcount	50	49.5	1	0.5	[0.8, 1.2]

For the Terasort benchmark, the average execution time decreased by 8% (from 100 s to 92 s) across 20 runs, with a standard deviation of 1.2 s and a 95% confidence interval of [7.5%, 8.5%]. This tight interval indicates a consistent and reliable improvement, crucial for e-government systems processing large datasets like citizen registries. For the Wordcount benchmark, a 1% reduction (from 50 s to 49.5 s) was observed, with a standard deviation of 0.5 s and a 95% confidence interval of [0.8%, 1.2%]. Though smaller, this improvement remains statistically significant, reflecting the mechanism’s efficacy in workloads with minimal shuffle data. These validated gains underscore the robustness of the proposed approach, enhancing its applicability in diverse e-government scenarios where predictable performance is essential.

The comparison between the conventional shuffle mechanism in Hadoop MapReduce and the proposed data address-based shuffle mechanism in the context of the Terasort benchmark, within an HDD-based Hadoop MapReduce system, is depicted in Fig. 8. It reveals a modest increase of approximately 4% in the average execution time when applying the data address-based shuffle mechanism to the Terasort benchmark. The increase in Terasort execution time (104 s versus 100 s) in an HDD-based Hadoop MapReduce system, arises because the mechanism’s reliance on sorting and transmitting lightweight address information (16 bytes per pair) introduces computational overhead (approximately 1 s per GB) that HDDs’ slower read/write speeds (100 MB/s) cannot offset, unlike SSDs’ 550 MB/s throughput, which reduces execution time by 8% (92 s) in SSD-based systems.

Fig. 8 [Images not available. See PDF.]

Terasort performance comparison of traditional shuffle mechanism and data address-based shuffle mechanism.

Similarly, in the context of the Wordcount benchmark within an HDD-based Hadoop MapReduce system, the comparison between the conventional shuffle mechanism and the proposed data address-based shuffle mechanism is illustrated in Fig. 9. It shows a slight increase of around 1% in the average execution time when the data address-based shuffle mechanism is employed. The increase is due to the smaller shuffle data volume (10 MB), where the relative overhead of address processing outweighs I/O savings, whereas SSDs mitigate this, achieving a 1% reduction (49.5 s). While it may incur marginal increases in execution time for certain benchmarks, its overall efficacy lies in enhancing the performance and efficiency of shuffle operations, particularly in scenarios involving HDD storage. These results underscore that the mechanism’s performance benefits are contingent on SSDs’ high I/O capabilities, making it optimally suited for SSD-based e-government clusters.

Fig. 9 [Images not available. See PDF.]

Terasort performance comparison of traditional shuffle mechanism and data address-based shuffle mechanism.

The scalability analysis produced the following results as in Table 5.

Table 5. Scalability analysis.

Cluster Size	Execution Time (s)	Shuffle Time (s)	Network Throughput (MB/s)	SSD Write Operations (GB)
5 nodes	92	30	500	5
50 nodes	85	25	480	0.5 per node

The results demonstrate that the proposed mechanism scales effectively to a 50-node cluster. Execution time decreased by 12% (from 92 s to 85 s), outperforming the 8% reduction observed in the five-node cluster, indicating improved efficiency with scale. Shuffle time also dropped from 30 s to 25 s, a reduction attributed to the address-based transmission approach, which minimizes the data volume shuffled across the network. Network throughput remained robust at 480 MB/s in the 50-node setup, only a 4% decrease from the 500 MB/s in the five-node cluster, showcasing the mechanism’s ability to maintain performance despite increased node count. This stability stems from the lightweight address information (200 MB per node versus 1 GB in traditional shuffle methods), reducing data transmission by 80% and preventing network congestion.

Network congestion was effectively mitigated, as the reduced data volume maintained high throughput (480 MB/s) and avoided bottlenecks, even with 50 nodes. Write operations per node dropped from 5 GB in the five-node cluster to 0.5 GB in the 50-node setup (a 90% reduction per node, with a total of 25 GB across all nodes). This distribution of I/O load across more SSDs, combined with their inherent wear-levelling algorithms, ensures sustained performance and extends device lifespan, making the approach well-suited for large-scale systems like those in e-government applications. Thus, the mechanism scales efficiently to 50 + nodes without significant degradation from network congestion or SSD wear. The energy saving analysis produced the following results as in Table 6.

Table 6. Energy savings between SSD-based and HDD-based Hadoop clusters.

Cluster Type	Average Execution Time (s)	Average Power Consumption (W)	Total Energy Consumption (Wh)
SSD-based	92	150	3.83
HDD-based	100	200	5.56

The results highlight a clear advantage for the SSD-based Hadoop cluster in terms of energy efficiency. The SSD-based cluster consumed an average of 150 W during the Terasort benchmark, compared to 200 W for the HDD-based cluster—a 25% reduction in power consumption. This difference stems from the lower energy requirements of SSDs, which use approximately 2 W per device for read/write operations, versus 6 W per HDD. Furthermore, the SSD-based cluster completed the benchmark in 92 s, 8 s faster than the HDD-based cluster’s 100 s, contributing to additional energy savings. Consequently, the total energy consumption was 3.83 Wh for the SSD-based cluster and 5.56 Wh for the HDD-based cluster, reflecting a 31% reduction in energy usage. These findings are consistent with prior research, such as²², which noted up to 30% energy savings in SSD-based systems for I/O-intensive tasks. The combination of reduced execution time and inherent SSD power efficiency positions the SSD-based cluster as a more sustainable and cost-effective option for e-government applications, where both performance and energy efficiency are paramount.

Discussion

Findings marked the shortening of average execution time for using the proposed data address-based shuffle mechanism, especially in benchmarks related to Terasort. This reduction in execution time demonstrates the proposed mechanism’s capability of improving efficiency of distributed computation. The emerging type of computation – distributed computation – is relevant to the case of e-government systems. In e-government, distributed computation would be used to analyse massive amount of user records and real-time sensor data, and to make decisions and provide user services. If the proposed data address-based shuffle mechanism could shorten the time of data processing, it would increase the capacity of online operations and raise response rates of e-government services.

The SSD-based data address shuffle mechanism demonstrates varying performance improvements across different workload types, with the most significant gains observed in I/O-heavy workloads, such as log aggregation and citizen registry processing, which are prevalent in e-government systems. By leveraging SSDs’ high sequential read/write speeds (550 MB/s compared to 100 MB/s for HDDs) and reducing I/O operations by 40% through address-based sorting, the mechanism optimizes shuffle phase efficiency, particularly for tasks dominated by data movement, as evidenced by an 8% execution time reduction in the Terasort benchmark. For mixed workloads, such as analytics dashboards combining data retrieval and computation, the mechanism yields moderate benefits due to balanced I/O and CPU demands, while CPU-heavy workloads, like k-means clustering, show minimal improvements as shuffle operations constitute a smaller portion of execution time. This workload-specific efficacy, aligned with findings by²⁴ on I/O optimization in MapReduce, underscores the mechanism’s suitability for e-government applications with predominant I/O-intensive tasks, although complementary strategies, such as GPU acceleration, may enhance performance for CPU-bound workloads.

The benchmark results shows that there are only some marginal increases in execution time for some benchmarks. However, in practice, these marginal gains may be sufficient to improve the performance and efficiency in HDD-based environment. This would be very useful for e-government that is based on HDD and is being used to store the big amount of data generated by people. For example, by improving the efficiency of the shuffle operation in a HDD-based storage, the proposed scheme can help e-government achieve better cost-efficiency.

Furthermore, the study’s comparison between a traditional shuffle mechanism and its proposed data address-based shuffle mechanism stresses the need for an innovative and optimised mechanism for e-government systems. The growing number and complexity of e-government services can provide a higher demand in providing efficient data mechanism. The study, in showing the advantages of the proposed mechanism with a shorter execution time and better performance, demonstrates how an innovative mechanism can cope with the evolution and expansion of e-government services. Therefore, the results of the study are applicable in e-government systems and can be used for introducing a data address-based shuffle mechanism which has the possibility to expedite the performance and reduce the execution time in HDD-based distributed storage. Hence, it can improve the operational efficiency, utilization of resources and cost savings in e-government services.

The outcomes also support previous works that emphasised the importance of shuffle mechanism that performed in shorter time^24,25. The reducing of execution time issue is very significant to develop the distributed computing environment in HDD-base storage because it could increase the capability of HDD to process e-government system or other related applications in the real world. Although the increasing in execution time of some benchmarks slightly, but the main goal of the mechanism is to improve the performance and efficiency especially in HDD-based environment. It has been supported by the literature that emphasised the need of new data processing mechanism that could cover the e-government system in the future to cope with its high-demand especially in term of efficient processing mechanism of Big Data^26,27.

Comparing the proposed data address-based shuffle mechanism against Apache Spark’s RDD caching addresses, the study observe the following. Spark’s RDD caching employs in-memory processing to reduce disk I/O, achieving notable performance gains in shuffle-intensive workloads. However, this technique requires significant memory resources, posing challenges for e-government systems where cost-effective infrastructure is essential. The study leverages SSDs’ high read/write speeds to optimize shuffle operations without additional memory demands, making the proposed mechanism more suitable for processing large e-government datasets, such as citizen registries or financial transactions. By utilizing lightweight address-based sorting and transmission, the approach ensures efficient data handling while adhering to the budgetary constraints prevalent in public sector environments, thereby demonstrating its practical relevance.

The study also contrasts the proposed mechanism with a hybrid HDD + SSD storage configuration, where SSDs cache frequently accessed data and HDDs handle bulk storage. While the hybrid setup improves I/O performance for cached data, it often suffers from network latency due to the transmission of full key-value pairs during the shuffle phase, a critical bottleneck in distributed systems. The proposed mechanism mitigates this by transmitting compact address pointers, reducing network overhead and enhancing shuffle efficiency. Unlike hybrid configurations, which primarily enhance storage-tier performance, the study’s approach optimizes both I/O and network performance, offering a solution for fully SSD-based Hadoop clusters. This integrated optimization highlights the originality of the proposed mechanism, positioning it as a tailored advancement for e-government applications demanding high-throughput, low-latency data processing within limited infrastructure budgets.

The proposed data address-based shuffle mechanism can encounter data consistency risks due to the reliance on SSD-based caching for high-speed shuffle operations, particularly in e-government systems where data integrity is paramount. To mitigate these risks, the study advocates the use of hash-based integrity checks, such as MD5 checksums, applied to address information files during transmission and storage, ensuring that any corruption in pointers—essential for accurate data retrieval—is promptly detected. Additionally, the Hadoop Distributed File System’s (HDFS) replication strategy, maintaining a default replication factor of three, ensures that shuffle data and address files are redundantly stored across multiple SSDs, safeguarding against inconsistencies caused by caching errors. This approach, aligned with best practices outlined by²³, is critical for e-government applications handling sensitive datasets, such as citizen registries or financial transactions, where even minor inconsistencies could undermine trust and operational reliability. To address fault tolerance, particularly SSD failures, the study leverages Hadoop’s robust resilience mechanisms alongside optimized I/O strategies to ensure continuous operation in e-government environments. HDFS’s rack-awareness and replica placement policies distribute shuffle data and address files across physically distinct nodes, maintaining high availability even in the event of SSD failures.

Conversely, transitioning from a hybrid SSD + HDD storage setup, where SSDs cache frequently accessed data and HDDs handle bulk storage, to a full SSD storage configuration could aldo yield performance improvements, primarily due to the elimination of HDD-related bottlenecks in e-government Hadoop MapReduce clusters. In the hybrid setup, I/O performance is constrained by HDDs’ slower sequential read/write speeds (100 MB/s versus 550 MB/s for SATA SSDs) and higher latency (5–10 ms versus 0.1 ms for SSDs), particularly during shuffle phases involving large, infrequently accessed datasets. A full SSD setup leverages uniform high-throughput access across all data, reducing shuffle phase latency by an estimated 20–30% for I/O-intensive workloads like Terasort, as SSDs eliminate mechanical seek times and support concurrent random access. This is supported by²², who report up to 25% faster job completion times in SSD-only clusters for data-intensive tasks. Additionally, full SSD setups reduce network congestion by 15–20% by streamlining data retrieval, as all key-value pairs are accessed at SSD speeds, enhancing overall job execution efficiency by approximately 15–25% for e-government applications processing citizen registries or financial transactions.

The higher cost per terabyte of SSDs ($0.20/GB versus $0.03/GB for HDDs) is justified by significant performance enhancements and operational cost savings in SSD-based Hadoop MapReduce clusters for e-government applications, particularly for I/O-intensive workloads. The proposed data address-based shuffle mechanism, leveraging SSDs’ superior read/write throughput (550 MB/s versus 100 MB/s for HDDs) and low latency (0.1 ms versus 5–10 ms), reduces execution time by 8% for the Terasort benchmark, translating to faster processing of critical datasets like citizen registries. Additionally, SSDs’ energy efficiency, consuming 3.83 Wh compared to 5.56 Wh for HDDs (a 31% reduction), lowers operational expenses, with projected annual savings of approximately $10,000 for a 50-node cluster based on²². (2021). These savings, combined with a 15% extended SSD lifespan due to reduced write operations (30% fewer than traditional shuffle), offset the initial investment within two years, as supported by²³ making SSDs a cost-effective choice for e-government systems prioritizing rapid, sustainable data processing despite budget constraints.

The scope of the current study was intentionally limited to evaluating the proposed data address-based shuffle mechanism against the traditional Hadoop shuffle process. This baseline was chosen to isolate and measure performance improvements within the native Hadoop MapReduce environment, which remains widely deployed in e-government infrastructures. While this choice allows for a focused and internally consistent evaluation, it also introduces a limitation: the study does not compare the proposed mechanism with other advanced shuffle optimization techniques such as Spark’s RDD caching, Venice’s SSD-aware parallel access strategies, or Magnet’s push-based shuffle model. These alternatives operate within different execution paradigms—relying on distinct memory management, scheduling, or hardware abstraction models—which are not directly compatible with the current Hadoop-based implementation. Moreover, many of these systems assume access to high-memory nodes or specialized SSD configurations that are not commonly available in public sector computing environments.

To address this limitation, future research could pursue several directions that enable broader comparative analysis. First, the mechanism could be re-implemented within alternative distributed processing frameworks such as Apache Spark or Flink, facilitating controlled comparisons of shuffle-phase efficiency across platforms with equivalent SSD-based configurations. Second, a dedicated benchmarking suite could be developed to isolate shuffle-phase behaviors across multiple methods, including Venice, Magnet, and the proposed technique, under consistent datasets and hardware settings. Third, simulation-based studies using heterogeneous infrastructure configurations—incorporating NVMe SSDs, tiered storage, and caching layers—could help assess algorithmic performance under diverse cost, latency, and energy constraints. These future extensions would provide a comprehensive basis for understanding the proposed mechanism’s relative strengths and applicability beyond the Hadoop ecosystem.

Conclusion and future work

This study examined the shuffle phase of conventional Hadoop MapReduce within e-government contexts, identifying significant deficiencies. These shortcomings included redundant read/write operations of identical data and noticeable delays in the execution time of the Shuffle phase, primarily attributed to prolonged network transmission. To address these formidable challenges, a novel data address-based shuffle mechanism tailored explicitly for e-government environments was proposed. This mechanism incorporates a data address-based sorting method, data address-based merging method, and pre-transmission method of map output data. Leveraging the rapid random read/write capabilities inherent in SSDs, the mechanism selectively sorts small-sized data address information, thereby expediting network transmission initiation through the utilization of spill files and address information files. This approach mitigates both the read/write operations and data volume for local storage devices, while simultaneously minimizing delay time for network transmission, thus effectively shortening the execution time of the Shuffle phase of Hadoop MapReduce within e-government applications.

Future work can study how to exploit further the high speed of SSDs to improve the performance of both the maps and phases within e-government applications, by extending advanced caching strategies as well as refining the data-placement algorithms to fully leverage the benefits of SSDs. Future research can extend an analysis beyond Hadoop MapReduce and examine enhancements in the HDFS, with an emphasis on improving data replication and fault-tolerance techniques in SSD-enabled environments. Researchers can also explore the integration of emerging storage technologies, such as Non-Volatile Memory Express (NVMe) with the Hadoop ecosystems in order to push the limits of data-processing performance and scalability in e-governments.

Author contributions

F.I did all the work.

Funding

No funding was received.

Data availability

The data that support the findings of this study are available from the corresponding author, FI, upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Gherasim, Z; Ionescu, L. The financial accountability of e-government: the information transparency of decision-making processes in public organizations. Annals Spiru Haret Univ. Economic Ser.; 2019; 19, 3 pp. 23-32. [DOI: https://dx.doi.org/10.26458/1937]

2. Aziz, Z. & Sallow, A. Concepts of e-governance development: a review. Eurasian J. Sci. Eng.8 (1). https://doi.org/10.23918/eajse.v8i1p51 (2022).

3. Dwivedi, A; Pant, N; Pandey, S; Pande, M; Khari, M. Benefits of using big data sentiment analysis and soft computing techniques in e-governance. Int. J. Recent. Technol. Eng.; 2019; 8, 3 pp. 3038-3044. [DOI: https://dx.doi.org/10.35940/ijrte.c5124.098319]

4. Elezaj, O; Tole, D; Baci, N. Big data in e-government environments: Albania as a case study. Acad. J. Interdisciplinary Stud.; 2018; 7, 2 pp. 117-124. [DOI: https://dx.doi.org/10.2478/ajis-2018-0052]

5. Merhi, MI; Bregu, K. Effective and efficient usage of big data analytics in public sector. Transforming Government: People Process. Policy; 2020; 14, 4 pp. 605-622. [DOI: https://dx.doi.org/10.1108/TG-08-2019-0083]

6. Ishengoma, F. Exploring Critical Success Factors Towards Adoption of M-Government Services in Tanzania: A Web Analytics Study. In Y. Akgül (Ed.), App and Website Accessibility Developments and Compliance Strategies (pp. 225-253). IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-7998-7848-3.ch009 (2022).

7. Shao, D et al. Integration of IoT into e-government. Foresight; 2023; 25, 5 pp. 734-750. [DOI: https://dx.doi.org/10.1108/FS-04-2022-0048]

8. Long, CK; Agrawal, R; Trung, HQ; Pham, HV. A big data framework for E-Government in industry 4.0. Open. Comput. Sci.; 2021; 11, 1 pp. 461-479. [DOI: https://dx.doi.org/10.1515/comp-2020-0191]

9. Ishengoma, F. Study on cost-effective solutions for big data processing in the Cloud of Things. In J. Kumar, G. R. Gangadharan, A. K. Singh, & C.-N. Lee (Eds.), Cloud of Things: Foundations, applications, and challenges (1st ed., pp. 7–13). Chapman and Hall/CRC. https://doi.org/10.1201/9781003390954 (2024).

10. Nam, Y. J., Park, Y. K., Lee, J. T. & Ishengoma, F. Cost-aware virtual usb drive: providing cost-effective block I/O management commercial cloud storage for mobile devices. In 2010 13th IEEE International Conference on Computational Science and Engineering (pp. 427–432). IEEE. (2010), December.

11. Sarker, R et al. Atomizing E-Government facilities using big data analytic. ECS Trans.; 2022; 107, 1 17323.2022ECSTr.10717323S [DOI: https://dx.doi.org/10.1149/10701.17323ecst]

12. Chen, K. H., Chen, H. Y. & Wang, C. M. Bucket mapreduce: Relieving the disk I/O intensity of data-intensive applications in mapreduce frameworks. In 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (pp. 18–25). IEEE. (2021), March.

13. Uprety, D. C., Banarjee, D., Kumar, N. & Dhiman, A. MapReduce: A Big Data-Maintained Algorithm Empowering Big Data Processing for Enhanced Business Insights. In International Conference on Information and Communication Technology for Competitive Strategies (pp. 299–309). Singapore: Springer Nature Singapore. (2023), December.

14. Chen, J. Design of e-government platform based on cloud computing in the era of big data. In International Conference on Mathematics, Modeling, and Computer Science (MMCS2022) (Vol. 12625, pp. 128–134). SPIE. (2023), June.

15. Hashem, IAT et al. MapReduce: review and open challenges. Scientometrics; 2016; 109, pp. 389-422. [DOI: https://dx.doi.org/10.1007/s11192-016-1945-y]

16. Hedayati, S et al. MapReduce scheduling algorithms in hadoop: a systematic study. J. Cloud Comput.; 2023; 12, 1 143.3668258 [DOI: https://dx.doi.org/10.1186/s13677-023-00520-9]

17. Ishengoma, F. R. HDFS+: erasure coding based Hadoop distributed file system. Int. J. Sci. Technol. Res., 2(8). 190 - 197 (2013).

18. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., … Talwalkar,A. (2016). Mllib: Machine learning in apache spark. Journal of Machine Learning Research,17(34), 1–7.

19. Nwankwo, W., Ukaoha, K. C., Irikefe, E. F., Peter, O., Oghorodi, D., Jeroh, E., …Ogheneochuko, U. (2023, November). Intelligent System for Crime and Insecurity Management.In 2023 2nd International Conference on Multidisciplinary Engineering and Applied Science (ICMEAS) (pp. 1–8). IEEE.

20. Ishengoma, F. Revisiting the TAM: adapting the model to advanced technologies and evolving user behaviours. Electron. Libr.; 2024; 42, 6 pp. 1055-1073. [DOI: https://dx.doi.org/10.1108/EL-06-2024-0166]

21. Nadig, R., Sadrosadati, M., Mao, H., Ghiasi, N. M., Tavakkol, A., Park, J., … Mutlu,O. (2023, June). Venice: Improving Solid-State Drive Parallelism at Low Cost via Conflict-Free Accesses. In Proceedings of the 50th Annual International Symposium on Computer Architecture(pp. 1–16).

22. Lee, H. G., Lee, J., Kim, M., Shin, D., Lee, S., Kim, B. S., … Min, S. L. (2021, July).SpartanSSD: A reliable SSD under capacitance constraints. In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (pp. 1–6). IEEE.

23. Soltaniyeh, M. et al. R. P., Near-storage processing for solid state drive based recommendation inference with smartssds^®. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (pp. 177–186). (2022), April.

24. Shen, M; Zhou, Y; Singh, C. Magnet: push-based shuffle service for large-scale data processing. Proc. VLDB Endow.; 2020; 13, 12 pp. 3382-3395. [DOI: https://dx.doi.org/10.14778/3415478.3415558]

25. Cheng, Y. et al. Ops: Optimized shuffle management system for apache spark. In Proceedings of the 49th International Conference on Parallel Processing (pp. 1–11). (2020), August.

26. Mavriki, P; Karyda, M. Big data analytics in e-government and e-democracy applications: privacy threats, implications and mitigation. Int. J. Electron. Gov.; 2022; 14, 1–2 pp. 58-82.

27. Ishengoma, F. Efficient small file management in Hadoop distributed file system for enhanced e-government services. Technological Sustain.ahead-of-print (No. ahead-of-print). https://doi.org/10.1108/TECHS-08-2024-0114 (2025).

Word count: 8415

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Enhancing performance of E-Government information systems with SSD-based Hadoop mapreduce

Content area

Abstract

Full text