Size reduction of inverted files using data compression and data structure reorganization
Abstract (summary)
Advances in the speed and power of computers has increased the usage of free-text retrieval systems. Free-text retrieval allows users to search the text of each document in a database for arbitrary combinations of words and often the relative location of those words.
Inverting files is one of the most popular techniques for storing and retrieving data in free-text systems. However, the storage requirements for an inverted file system can double the size of the database, since the data in the inverted file almost duplicate the original database. A reduction in size of the inverted file saves computer storage and improves retrieval response.
This study examined four techniques for reducing the size of inverted files and measured their effectiveness in improving retrieval response.
A free-text information retrieval system was constructed on a computer in a controlled environment. The retrieval performance of 120 sample queries, including the detection of any errors, was measured and recorded. The four size reduction techniques were applied in turn to the information retrieval system. The techniques included a non-linear compression code, an ambiguous hash code, a Zipfian data structure, and a modified-Zipfian data structure. The resultant inverted file storage size and the retrieval performance of the same 120 sample queries were measured for each technique.
Each size reduction technique improved retrieval performance. The improvement for the compression codes was small. Greater improvement was obtained by the Zipfian reorganization methods that took advantage of the Zipfian distribution of words in textual documents and stored data in blocks on computer disk drives. The modified-Zipfian data structure method provided a space savings of 38.9% and a retrieval time performance of 27.5%. Each of the reduction techniques agreed with a theoretical storage model but disagreed with a theoretical access time model.
The study concluded that significant response time improvements can be obtained for information retrieval systems, without introducing any errors, by reducing the size of inverted files. It also concluded that the most effective data reduction techniques take advantage of the nature of the data.