Content area
Full Text
About Web Archiving and Preserving Content
Archiving the Web is the process through which documents and objects on the World Wide Web are captured and stored. There are and have been a number of ways through which this has been accomplished, but the end result is archived Web content (Web site, page, or part of a Website) that is preserved for future researchers, historians and the general public. Since the early days of the Web, much attention has been paid to the processes of capturing Web resources as a necessary first step for preservation. Since that time, International Internet Preservation Consortium (I I PC) members involved in archiving culturally important Web content have created Web archives of several petabytes of data. However, relative to many of the approaches that have been either proposed or implemented for other types of digital collections, efforts to sustain long-term access to those resources are comparatively immature.
Preservation involves maintaining the ability to present meaningful access to information over time. In the context of Web archives, the intention of preservation is to retain access to archived Web resources, so they can continue to be used and understood despite changes in access technologies or without unacceptable loss of integrity or meaning.
There is also divergence in the understanding of what preserving Web content actually means. Defining the basic unit to be preserved is necessary in choosing how we count and evaluate our assets and has implications on the level of treatment we apply, including packaging the data and choosing metadata.
* Are we preserving each harvested Web file/Web page? If so, each individual Web file should be processed as a separate document, which leads to according individual treatments such as identification, characterization and description at that level of granularity.
* Are we preserving Websites? Websites are generally viewed as the logical units of the World Wide Web and institutions generally catalogue at this level.1 One complexity is defining a Web site. For example, hillaryclinton.com was captured by the Library of Congress during the 2008 Presidential Election. Is the Web site hillaryclinton.com or is it also her archived Facebook, Twitter, YouTube, and video content? If an institution wanted to record the original structure of the Website (notably recording the hypertext relationships...