Content area
At the Ramapo Catskill Library System (RCLS), we recently faced a dilemma involving one of the Web sites we maintain. The site is named LibraryLand, and it aims to be a comprehensive resource guide listing Web links of interest to librarians.
Full text
Within two years, LibraryLand, a Web site created in 1996 to organize unannotated bookmarks of other library science-related sites, had expanded to unmanageable proportions. The site's authors cast about for solutions to carry forth the original goals ofthe site while containing the labor involved in its maintenance. The solution settled upon might be a new evolutionary trend oflarge-scale link sites-transformation into specialized indexing services spanning many remote servers.
At the Ramapo Catskill Library System (RCLS), we recently faced a dilemma involving one of the Web sites we maintain. The site is named LibraryLand, and it aims to be a comprehensive resource guide listing Web links of interest to librarians. It was begun with the dual goals of organizing/sharing bookmark files within our library system and promoting the utility of the Web to hesitant member library boards and directors. When this site was launched in early 1996, the scope of the project appeared to be ambitious but manageable, but by 1998 we had to concede that our reach exceeded our grasp. The maintenance of the LibraryLand site was straying more and more beyond our mission as an organization and was consuming a disproportionate amount of time and labor. Therefore, we were faced with the options of leaving LibraryLand to wither on the vine or to limit its scope, or of streamlining its maintenance to be much less time-consuming.
Large Resource
Guide Sites Are Black Holes
To get a better idea of the scale of the maintenance problem we're talking about, let's describe LibraryLand (circa 1999) in more detail. LibraryLand organized URLs into traditional library service disciplines, such as Technical Services, Serials, Automation, Children's Services, and so on. Within each of those broad categories, another layer of specific categories was created as needed to break out groups of URLs to be easily scanned by the eye. Therefore, LibraryLand was a Yahoo-like subject hierarchy of two tiers. Each category and subcategory was formatted as a separate HTML page. By mid- 1998, there were eighteen broad categories and about 150 more specific categories within those eighteen broad categories, each a separate HTML document. Within these pages, LibraryLand included approximately 4,300 URLs. The listings did not include annotations, ratings, or evaluations, but did include a small icon signifying who was responsible for the listed site: academic institution, nonprofit organization, public library or school, government agency, commercial enterprise, or self-publication.
From the start, the scope of LibraryLand was loosely defined as inclusive of any links that would be of interest to the public libraries we serve. Soon after launching the LibraryLand site in 1996, we observed traffic from and received e-mails from other users, such as school libraries and academic libraries. Responding to these users, the scope of LibraryLand was expanded between 1996 to 1998 to include aspects of library service in which our institution had little experience, such as music collections, art collections, reserve room operation, serial back issues, government documents, conservation, archives, and so on. Thus, we were expanding the scope of LibraryLand in addition to coping with burgeoning sites to catalog in our existing categories.
We started automated link checks of the LibraryLand site in late 1997, using the commercial LinkBot product (http://www.tetranetsoftware.com/). However, although link checking products does alleviate much of the task of identifying good and bad links, manual review of reported Moved and Bad (404 Error) links is still required to identify relocated sites. Our experience suggests that half the reported 404 errors reported could actually be found elsewhere on reorganized Web servers or under new domains. Unfortunately, we had to conclude that automated link checking in and of itself was not sufficient to ease the time being sunk into the LibraryLand site.
Narrowing the Scope
(what's wrong with web rings?)
During the months of collecting URLs for LibraryLand, we noted there were other Web resource guides that specialized in topics that overlapped some of the major categories on our site-sites like the following: and so on. Not only did these sites overlap the coverage of LibraryLand sections, they were often better! They were updated more often, contained more site links, included descriptive and/or critical annotations, and organized their material more systematically. Moreover, they were maintained by subject experts.
What would be wrong with pulling down our overlapping sections of LibraryLand and instead providing direct links to these other excellent resource sites? We even considered (but never actually tried to coordinate) the creation of a "Web ring" that would link LibraryLand and other library-related resource guides. However, it took little reflection to realize that Web rings are best suited to encourage surfing of related sites and are not appropriate for quickly finding information located on any one individual site. Hence, the popularity of Web rings for hobby and fan sites, as opposed to more serious topical sites. (It would be nice to see commercial Web ring sponsors offer indexing of all sites included in a ring as an added service.)
There were other reasons why the Web ring concept was unsatisfactory. Dropping selected categories from LibraryLand would hurt the thematic integrity of the remaining sections. Moreover, we had already added a site search capability to LibraryLand in 1997, which didn't yet have the support to index remote sites. Thinking about the site search limitations did, however, lead us in the direction of pursuing the idea of finding a way to transform LibraryLand into an index of remote sites.
Swish!
Surprisingly (given the plethora of site search software packages available to Webmasters), after a hasty look it seemed that few options existed for targeting indexing of disparate servers. Our own site search software, the freeware SWISH-E package (http: / /suns ite.berkeley. edu/SWISH-E/), did not document the support of HT-P indexing of remote servers (until recent versions, but by then we had been steered in a slightly different direction). Indeed, many of the self-professed subject search engines we found on the Web were not actually indexing other servers; they were merely indexing a database of cataloged sites residing on their own servers.
The RCLS was certainly familiar with the indexed database concept-it's the mechanism used in the KidsClick! (http: / /suns ite.berkeley. edu/ KidsClick! /)search guide for kids that we launched in early 1998 under a Library Services and Technology Act grant. To digress just a little, RCLS was inspired to create KidsClick! through the example of the Librarians' Index to the Internet (LH) (http: / /suns ite.berkeley. edu/Internet Index/), a project initiated by Carole Leita at the Berkeley Public Library and supported by the Berkeley Digital Library SunSITE (http: / /sunsite. berkeley. edu/), a cooperative effort between the University of California at Berkeley and Sun Microsystems. The technology that drove the LII searching, coincidentally, was the same SWISH-E engine we were already using as our LibraryLand site search.
So when considering indexing remote sites, without wasting time doing a thorough literature search, we used Occam's razor and just sent an e-mail to Roy Tennant, the manager of the Berkeley SunSITE. Roy was familiar with SWISH-E's capabilities, since the enhancement work of the original SWISH engine had been done at Berkeley. Roy was also responsible for adapting SWISH-E to the database maintenance routines used by both LII and KidsClick! So our question to Roy was, Is there any way to index remote sites using SWISH-E? Roy wasn't too encouraged by what he knew of the remote server indexing capabilities that were just being built into SWISH-E. To his mind, those features were being developed mainly to support spider indexing of servers on local or wide-area networks, and SWISH-E alone would not really be adaptable enough to target specific directories via its -TrP indexing. Roy told us he would mull over the idea of how remote indexing could be done.
Remote Indexing? Focused Searching?
Specialized Search Engines? Harvesting?
Limited Search Engines?
What's This Thing Called?
As mentioned above, our hasty initial search revealed few options for targeted HTTP indexing of remote servers. Why are so few of these utilities being marketed to a broader public when the technology that allows targeted indexing of submitted URLs has long been used by the major search engines (Alta Vista, Lycos, Excite, InfoSeek)? Why isn't this a more common technology, given recent reports (http: / /www. metrics. com/) that paint a pessimistic view of the low coverage of the entire Web by these major search services? Wouldn't it make sense that specialized search engines should proliferate as the all-inclusive engines struggle?
The main reason this technology isn't more common at this point is due to the behavior of "bots," the automated programs that crawl from one server to another, compiling URLs and indexing documents. Historically, there have been problems with what these agents do as they are working. The Botspot (http: //www. botspot. com/) site explains:
At times, Webmasters look on some forms of robots as a nuisance. A spider robot may uncover information the Webmaster would prefer would remain secret; occasionally, a bot will misbehave as it crawls through a Web site, looking for URLs over and over, and slowing down the server's performance. As a result, search engine developers have formed standards on how robots should behave and how they can be excluded from Web sites. Moreover, in our opinion it appears that there has been much more interest in incorporating tricky artificial intelligence routines into data mining agents than in creating agents that simply require a given list of servers and directories to target and visit. To us, it appears as though a market is being missed, and much of that potential market could be librarians.
However, there are several projects in the library and academic community that have used slightly different technologies to create indexes of documents on targeted remote servers. There does not yet appear to be a standard term for this type of project: "remote indexing:' "focused search engines:' "specialized search engines:' "harvested indexing:' and "limited search engine" are all terms that have been applied to such projects (as well as to other services that are really just searching internal databases). Some of those notable projects include the following:
Remote Retrieval, Local Indexing
After thinking over the problem we had presented to him, Roy Tennant suggested using the free UNIX utility wget (http : / / suns it e . auc . dk / wget /) to retrieve files from other servers, then locally index them with SWISH-E. Roy offered to develop a straightforward script that would translate the display of the indexed file paths back to the original remote URLs. Wget would give us a high degree of control over the remote paths we wanted to retrieve and the types of files that could be included or excluded. Performing the actual SWISH-E indexing after the files had been retrieved would be quick and would avoid drag on remote servers that is the main criticism of bots. Moreover, the LibraryLand Index project was viewed as failing within the scope of the Berkeley Digital Library SunSITE's support of digital library research and development projects. Roy could develop the needed routines on the SunSITE server, giving access to the RCLS for file maintenance and script running.
In March 1999, Roy ran a sample of the wget/SWISH-E tandem on two of the URLs that we planned to target with the LibraryLand Index. With a few more tests, the routine appeared to be working so well that it was announced on the Web4lib listserv as a working service at its new home, http://sunsite.berkeley.edu/LibraryLand/ on March 29, 1999.
The first indexing run included the remaining LibraryLand link lists as well as ten other sites:
AcqWeb
Libstats
Library Support Staff Resource Center
Young Adult Librarian's Help/Home Page
Electronic Reserves Clearinghouse
Resources of Use to Government Documents Librarians
Health Sciences Internet Librarianship Resource Page
Book Arts Web
Conservation Online
FIDDO Project
Prior to this announcement, RCLS attempted to contact the Webmasters of all the above sites to inform them of our indexing plans and to reassure them that the process would not be a drag on their servers.
Issues In Indexing
Almost immediately, several issues arose concerning how the indexing worked and the usability of the search results. What types of documents should be indexed: HTML, txt, doc, pdf, wpd, xls? Should mailing list archives be included in the indexing? Should archived newsletters and journals be indexed? Should any document be indexed that was not originally intended for Web publication? How can search results be meaningfully presented when several indexed sites contained less than 100,000 bytes each in a few files, while other sites indexed over 67 million bites in thousands of files? How can wget be adjusted to exclude files we don't want, when they are within the same directory structure as files we do want? How often should reindexing take place? How much space could we afford to use to store the retrieved sites? How many different sites did we want to expand our indexing to? And, Is anybody using LibraryLand?
As these issues arose, we (Jerry Kuntz at RCLS, and Roy Tennant at SunSITE) made some quick decisions. First, RCLS adjusted the wget command to exclude retrieval of anything except HTML documents. We made a judgment that only documents specifically created for Web publication should be indexed. At the same time, we refined the wget commands to exclude discussion list archives and archives of serials translated from print. Here, our reasoning was that storage considerations-combined with alternate indexing that existed for these data--combined with diminished timeliness of the data all argued against inclusion.
As far as presenting meaningful search results of hits from sites of such different scale, at RCLS's suggestion Roy developed a new initial search result page that was unveiled at the beginning of June 1999. This initial screen lists the document hit count on the input search terms at each of the indexed sites (see Figure 2).
The idea behind this is that users will refine their search from this screen to target the hits from the site most likely to answer their query. This intervening step seemed the best way to address the problem of the differing scalability of the sites we were indexing, without requiring the user to use overly complicated query statements or qualifiers.
Recent upgrades to the Berkeley Digital Library SunSITE caused Roy to reassure us that the drive space existed if we wanted to index additional large-scale sites. In fact, Roy encouraged us to do so. In late July 1999, the indexed sites grew to eighteen, including the following:
Internet School Library Media Center
American Library Association
Librarians Serving Genealogists
Ready, 'Net, Go! Archival Internet Resources
Art Libraries Society of North America
LLRX Links
Serials in Cyberspace
Music Library Association Clearinghouse
It remains to be seen whether LibraryLand's goal to become a quick-finding tool for library-related Web documents will gain popularity. Until that is clear, RCLS has made a determination that reindexing will take place approximately every two months and that three to four new sites will be added each time to replace as many remaining sections of the original LibraryLand link lists as possible. We are already tracking SunSITE's snapshot Web server statistics to gauge the LibraryLand Index's popularity, and we will look at those figures in conjunction with the number of sites linking to the new LibraryLand and with anecdotal reaction from the library community.
About the Author
Jerry Kuntz is currently the electronic resources consultant for the Ramapo Catskill (NY) Library System; formerly the automation managerfor the Finger Lakes (NY) Library System and systems librarian for the Morris County (NJ) Library. Graduate, Rutgers University (M.LS); Earlham College (B.A.). Recipient, 1992 UMl/Data Courier Library Technology Award. Editor, Library Technology Consortia (Mecklermedia, 1994). Project manager for KidsClick! (http: //sunsite.berkeley.edu/
KidsClick!//), a Web guide/search engine for children by librarians. Web manager of LibraryLand (http: / / sunsite.berkeley.edu/LibraryLand/), an index of Web resources for librarians, and DeskRef (ht tp: //www. rcls. org/deskref/), Ramapo Catskill's virtual reference desk. Communications to the author should be addressed to his e-mail: jkuntz@ansernet. rcls. org.
Copyright Sage Publications, Inc. 1999
