Full Text

Translate

http://crossmark.crossref.org/dialog/?doi=10.1007/s00799-015-0153-3&domain=pdf

Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00799-015-0153-3&domain=pdf

Web End = Int J Digit Libr (2015) 16:247265 DOI 10.1007/s00799-015-0153-3

http://crossmark.crossref.org/dialog/?doi=10.1007/s00799-015-0153-3&domain=pdf

Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00799-015-0153-3&domain=pdf

Web End = Lost but not forgotten: nding pages on the unarchived web

Hugo C. Huurdeman1 Jaap Kamps1 Thaer Samar2 Arjen P. de Vries2

Anat Ben-David3 Richard A. Rogers1

Received: 2 December 2014 / Revised: 8 May 2015 / Accepted: 10 May 2015 / Published online: 3 June 2015 The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main ndings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have

A. Ben-David: work done while at the University of Amsterdam.

B Hugo C. Huurdeman

[email protected]

Jaap Kamps [email protected]

Thaer Samar [email protected]

Arjen P. de Vries [email protected]

Anat Ben-David [email protected]

Richard A. Rogers [email protected]

1 University of Amsterdam, Amsterdam, The Netherlands

2 Centrum Wiskunde en Informatica, Amsterdam, The Netherlands

3 The Open University, Raanana, Israel

more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to signicantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the rst ranks on...

Show less

Lost but not forgotten: finding pages on the unarchived web

Full Text

Suggested sources

Lost but not forgotten: finding pages on the unarchived web

Content area

Full Text

Suggested sources