Efficiently Finding Near Duplicate Figures in

Full text

Headnote

Abstract-The increasing interest in archiving all of humankind's cultural artifacts has resulted in the digitization of millions of books, and soon a significant fraction of the world's books will be online. Most of the data in historical manuscripts is text, but there is also a significant fraction devoted to images. This fact has driven much of the recent increase in interest in query-by-content systems for images. While querying/indexing systems can undoubtedly be useful, we believe that the historical manuscript domain is finally ripe for true unsupervised discovery of patterns and regularities. To this end, we introduce an efficient and scalable system that can detect approximately repeated occurrences of shape patterns both within and between historical texts. We show that this ability to find repeated shapes allows automatic annotation of manuscripts, and allows users to trace the evolution of ideas. We demonstrate our ideas on datasets of scientific and cultural manuscripts dating back to the fourteenth century.

Keywords-component, cultural artifacts, duplication detection, repeated patterns

(ProQuest: ... denotes formulae omitted.)

I. INTRODUCTION

The world's books and manuscripts are being digitized at an increasing rate, and within a few years, the majority of the world's books will be online. Much of the data will be text, most of which is more or less amiable to optical character recognition. However, in addition, there will be perhaps hundreds of millions of pages that contain one or more images. It is clear that these images will be very difficult to process. Indeed, data mining of modern photograph images is challenging, and in the case of images from historical manuscripts the challenges are compounded by the problems of fading, staining, wear, insect damage, abrasions, foxing, pencil annotations, and distortion artifacts from the digitization process, etc. [24][25].

In spite of these challenges, it is clear that the wealth of figures from historical manuscripts offer unique possibilities for data mining of important cultural artifacts. While the completely automated extraction of data from these texts will remain a significant challenge for some time to come, in this work, we introduce a specialized sub-routine that is achievable and useful. This sub-routine is the automatic discovery of approximately duplicated figures, both within and between texts.

Our ideas can best be explained with a simple motivating example....

Show less

Efficiently Finding Near Duplicate Figures in Archives of Historical Documents

Full text

Suggested sources

Efficiently Finding Near Duplicate Figures in Archives of Historical Documents

Content area

Full text

Suggested sources