Content area
With the increasing complexity of computer-based applications, efficient access to high volumes of data became a necessity. Databases with tens or hundreds of Gigabytes of items are common, and applications reaching Terabyte volumes will not be unusual. Parallel Database Systems are expected to provide the required support to data intensive applications. There are, however, limits on the achievable scalability of current parallel database algorithms and topologies.
This work addresses such limitations, with emphasis to architectural features and the problem of data skew on parallel database systems. We propose an architecture which extends the features of the shared-nothing architecture, widely adopted for current parallel database applications. We also propose a new characterization of data skew which captures distinct types of imbalance, and present two data partitioning strategies to deal with this problem in a parallel system. These strategies are employed as components of parallel algorithms to implement external sorting and relational database operations. The execution time of such algorithms, obtained from analytical models derived from their implementations, is used as a measure to evaluate the proposed architecture.