Content area

Abstract

This thesis considers the problem of lossless data compression of a finite-length sample from a finite-alphabet random source with unknown symbol probabilities. Under conditions of stationarity, it is demonstrated that there exists an optimal symbol-probability estimate given only the counts of the previously observed symbols and the symbol-alphabet size. It is demonstrated that the traditional measure of storage requirements using Shannon's entropy measure, computed using the relative frequencies of the possible characters, is insufficient in those cases where the distribution of characters is not known in advance and will only asymptotically approach the true per-symbol storage requirements. The absolute error is O(log(N)) where N is the number of characters stored in the particular context. This implies that a variable-order Markov compression model is in general more space efficient than a fixed-length Markov process. The problem of removing memory from a source through a variable-order Markov model is also addressed. An impractical, but optimal, fully dynamic Markov coder is described but not implemented due to the computational complexity of the algorithm for even simple sources. The dynamic symbol probability estimate is combined with a semi-adaptive model in a proposed compression scheme. A noiseless data compression algorithm is derived that utilizes adaptive probability estimates for the outcome characters within a conditioning class but uses a semi-adaptive construct for the model structure. A mechanism is also created by which a locally optimal context order may be derived, which will benefit data with fixed-length records or that are more than one-dimensional in nature. The resulting family of compression schemes, called SAMC is compared to PPMC and other earlier schemes, and it is found to provide the best compression rates for all of the test files chosen.

Details

Title
Optimized data compression in the absence of a priori probabilities
Author
Hatton, Edward Douglas
Year
1995
Publisher
ProQuest Dissertations Publishing
ISBN
978-0-315-99808-7
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
304299385
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.