Content area

Abstract

Data is constantly changing. Today, there can be incremental updates to the existing data. As the data is evolving with new updates, the results of big data applications gradually become out of date and stale. It is required to refresh the results for every update efficiently.

Apache Spark is used to process multiple petabytes of data on clusters having thousands of nodes. The core abstraction of Spark is RDD (Resilient Distributed Dataset), which is an immutable collection of elements. Due to the immutability of RDD, Spark works information in parallel, permits information reuse, and handles failures and stragglers productively. But Spark lacks flexibility and efficiency of incremental processing of small updates.

In this thesis, IncRDD framework is proposed for incremental processing of updates to the existing data. IncRDD sustains all the powerful features of Spark including parallel processing, data reusability, and fault tolerance. New operations for RDD are implemented to add new records, update the existing records, and delete them.

We introduce a new variant of Cuckoo hashing, Dual-CH Fast-Simple. Dual Cuckoo hashing uses two cuckoo hash tables. The first cuckoo table is used to store records, in every partition of a node. The second hash table is used to implement structural sharing, which adds persistence, utilize previous versions, and avoids expensive re-computation. We evaluate IncRDD using incremental algorithms and provide experimental results to show the significant improvement in the performance of Incremental RDD.

Details

1010268
Classification
Title
IncRDD: Incremental Updates for Rdd in Apache Spark
Number of pages
58
Degree date
2017
School code
0382
Source
MAI 57/01M(E), Masters Abstracts International
ISBN
978-0-355-39219-7
Advisor
University/institution
The University of Texas at Dallas
Department
Computer Science
University location
United States -- Texas
Degree
M.S.C.S.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
10675195
ProQuest document ID
1973156184
Document URL
https://www.proquest.com/dissertations-theses/incrdd-incremental-updates-rdd-apache-spark/docview/1973156184/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic