Content area
Purpose - The purpose of this paper is to highlight the main features of the components of Proquest's giga database package for LIS faculty and students of databases and software services. These allow complimentary access to hundreds of indexing/abstracting, directory and full-text databases, to RefWorks, the most sophisticated reference management program, and to SUMMON, a powerful digital resource discovery program. Design/methodology/approach - This phase of the research focused on evaluating the largest module of GEP which offers 41 databases with more than 200 million records (half of them full-text documents), on the new ProQuest software platform. The paper presents the major content and software features of this module. Findings - The single module, GEP-41, is an important contribution to LIS education, providing free access to LIS faculty and librarians to so many databases covering the LIS and LIS-related fields, including the new ProQuest Library and Information Science database with more than 1.2 million items. The other modules of GEP extend the coverage to databases appropriate to LIS faculty and students interested in various tracks of librarianship. This project certainly will benefit Proquest itself in the long run. From the perspective of the primary beneficiaries, the LIS professors and students, the rich infrastructure for this project offers unprecedented opportunities for a digital renaissance in every aspect of LIS education and research. Originality/value - This service, highly relevant for LIS education worldwide, was released in late 2012, and research papers have not been published about it yet. The paper focuses on the measurable, quantitative traits of the largest component of the service.
Introduction
LIS schools and programmes, faculty and students have been often limited - for cost reasons - to using only databases and software tools which are licensed by their hosting university to serve the needs of the departments with the largest student body and faculty. The LIS units almost never belong to that league, as most of them have a faculty below 15, and a student body below 150 even in the most developed countries.
One of the key objectives of the efficient education in the Library and Information Sciences (LIS) programmes is to teach the students to become savvy searchers, and empower them to be efficient finders of information pertinent for the end-users. GEP officially stands for Graduate Education Program, but being free for LIS educators and students, it could also be the acronym for Generous Endowment Package. It is an outstanding tool set for LIS educators to teach and demonstrate to future and practicing librarians state-of-the-art online digital resource discovery, information retrieval, and post-processing services from hundreds of indexing/abstracting (I&A), full-text (FT), and directory databases, widely used in public, school, university, and research libraries.
The large database collection of GEP includes two sophisticated software tools (Summon and RefWorks) and many novel training and self-study resources. These provide a perfect infrastructure to help LIS instructors and students in doing research for their course work, preparing and presenting digital course materials, irrespective of what resources are available at their university which hosts the LIS school/programme. GEP is made available free of charge worldwide, i.e. not only to ALA and NCATE accredited LIS schools in the USA and Canada, but to all qualifying LIS schools (except for the Dialog Professional module available only to ALA accredited library schools).
There are rules for qualifying to access the free GEP service, with details available at [email protected], but Karen Hinton, Manager of Customer Education and Training for the Northwest American region, summarised the conditions well, and gives the opportunity to acknowledge her help in this research:
GEP and its sister project ERP (Educational Resources Program are available to faculty and students of graduate degree programs in library science and related disciplines worldwide. ProQuest is also available to instructors at schools of education recognized by the National Council for Accreditation of Teacher Education (NCATE). ProQuest's community site, the Discover More Corps, provides a portal to the free resources offered through the GEP and ERP, as well as a wealth of instructional materials, and a venue for educators and students to network with colleagues worldwide.
The need for constant access to a variety of databases and bibliographic software tools and their characteristics have been discussed in LIS journals and conference proceedings for a long time. Many of the most current ones focus on resource discovery, realising that users are not willing to jump from one database to another, or they are simply uninformed about the existence of all the databases licensed by their institutions ([1] Asher et al. , 2013; [7] Keene, 2011; [9] Rowe, 2011; [13] Wang and Mi, 2012; [14] Way, 2010). The options for managing bibliographic records keep growing and improving as well as the metadata elements describing resources ([2] Hensley and Kern, 2011; [8] Register et al. , 2008). These in turn require more web-based materials to enhance the information literacy of students ([11] Su and Kuo, 2010). Distance education adds new challenges to provide state-of-the-art skills in educating librarians ([10] Shaffer, 2011).
Not all LIS faculty members and students are aware of which databases have FT coverage of certain journals relevant for their courses, let alone the time span of FT coverage. The GEP service helps LIS educators and students in finding such information directly through UlrichsWeb, the Global Serials Directory, which is part of the free GEP package, although without the excellent Serials Analysis System module ([5] Jacso, 2012), or indirectly, by restricting the search by topic and/or by journal title to FT documents. Clicking on the automatically generated clusters by databases and publication titles immediately provides the details of all the databases and journals as well as the number of hits they provided for the query in the GEP-41. In this paper, the GEP-41 label will be used for this module which - on the surface - has 41 databases, including the three databases of the Community of Science (COS) service. There is much more than meets the eye in GEP-41 because several of the databases which appear on the menu are database families combining several other databases.
Sometimes the name of a database does not suggest at all the FT coverage of LIS journals. In the experience of this author it is little known by LIS faculty and students how comprehensive the I&A and the FT coverage of journals is in various databases. GEP identifies the databases, their volume of coverage and takes the user to the one(s) selected, as shown in Figure 1 [Figure omitted. See Article Image.] for Online Information Review (which started to appear under this title in 2000).
Several databases can be selected to be searched in one fell swoop, which is a very good idea in case the preferred database(s) miss a volume, an issue, or one or more article(s). Very smartly, the software eliminates the duplicates automatically, and even more smartly still allows the users to choose to display them.
The context
Support for high-tech LIS education by corporations in the information industry is not new. The Dialog Corporation (which is now part of ProQuest and was founded by one of the most far-sighted pioneers of the information industry, Roger K. Summit) has had a service by the same name, GEP, and similar functionality. Dialog's GEP also included materials for curriculum development and handbooks for instructors. Its free Online Training and Practice programme has offered since the early 1980s free access to subsets of many databases hosted on the Dialog platform. This service (in terms of its longevity and choice of databases) was not matched by its competitors until ProQuest launched the GEP service, having acquired the Dialog company earlier. It should be added that many representatives of the dozen other online information service providers have been exceptionally accommodating to requests for temporary access to certain databases for a special course or LIS-related research projects in the past 25+ years.
EBSCO supports LIS programmes (as well as practicing librarians and any interested users) but not on a scale comparable to ProQuest. It offers free access for anyone to the Library, Information Science & Technology Abstracts (LISTA) database, which by 2013 had grown to nearly 1.5 million records. Its Teacher Reference Center database has almost a million I&A records for articles from 280 journals, and GreenFILE offers more than 710,000 I&A records (including 11,000 FT documents, 3,000 of them with illustrations) related to environmental sciences.
ProQuest's recent launch of GEP is a landmark initiative for generous support for LIS education around the world. Its huge and free database collections, and two excellent programmes for federated searching and managing references, wrapped in a social networking context, is of extraordinary importance especially for the developing countries which have very limited budgets for licensing databases and software packages ([6] Jacso, 2013). GEP provides a good visualisation of its components (Figure 2 [Figure omitted. See Article Image.]).
Databases hosted on the ProQuest platform
This paper discusses the findings of the author about the largest module of the whole GEP service which operates on the new ProQuest platform. This set represents a combination of databases primarily produced by Cambridge Scientific Abstracts Inc (CSA) and ProQuest (under many of its former names). Some of the databases were created by other companies which were acquired by ProQuest. This component (GEP-41) does not have - yet - a distinctive qualifier such as the K-12, Dialog Professional, and Government Collection modules have which share the ProQuest logo but not the new software platform (except for the last mentioned module).
The GEP-41 module currently offers LIS faculty and students unlimited free access to a very good mix of databases, which in my estimate have about 205 million records, including about 115 million documents in one or more FT formats (ASCII plain text, HTML, and/or PDF). The size of many of the daily updated databases kept growing during the test phase, and data reported here represent the status as of 10 January 2013.
GEP-41 has 12 FT, two directory, and three newspaper databases/database families, and ProQuest's collection of three million dissertations and theses (with a puny 2.2 per cent of FT documents). The rest of the resources are I&A databases. As mentioned above, many of the databases in GEP-41 are database families, and include separate databases which can be licensed by libraries as single databases on their own. For example ABI/INFORM Complete consists of ABI/INFORM Dateline, ABI/INFORM Global, and ABI/INFORM Trade & Industry databases. The Technology Database is the parent database of 24 databases, Biological Sciences is the "pater familias" database of 29 databases specialising in sub-disciplinary areas. There are actually 92 databases at the micro level aggregated into 41 databases on the macro level. One micro-level database may be assigned to several of the macro-level databases.
ProQuest seems to deliver more than it promises (in terms of the number of databases), which is an appreciated but atypical practice. Because the same articles have I&A and/or FT records in several databases, such duplicates are not reported in the hit counts, so the numbers of records (items) in Table I [Figure omitted. See Article Image.] refer to the net values. This attitude in hit counting and reporting is fair, but the casual users looking at the list of 41 databases may not realise that their favourite databases are also covered, such Animal Behavior Abstracts, Toxicology Abstracts, METADEX, Computer and Information Systems Abstracts, Civil Engineering Abstracts, Earthquake Engineering Abstracts, and dozens of others. There is a tiny expand symbol on the database menu to mark database families, but a highly visible link to an elaborate display of a family tree of the databases would be very informative.
The numbers of records retrieved by the test searches from all the 41 databases/database families seem to be realistic - with one exception. There are only 30,906 records for the Public Affairs Information Systems (PAIS) database, which in reality has nearly half a million records as of early 2013.
Databases by discipline, genre, and FT availability
The databases are grouped in seven disciplinary clusters (arts, business, health and medicine, history, literature and language, science and technology, and social sciences) and two genre clusters (newspapers and dissertations and theses). Some of the databases are assigned to multiple disciplinary areas so the hit counts in Table I [Figure omitted. See Article Image.] should not be aggregated. It is worth repeating that the ProQuest software automatically eliminates the duplicates in reporting the hit counts. There are some glitches in the de-duplication process, and ProQuest was notified about them. It is not possible to do perfect de-duping even in a single database because of the inconsistencies and inaccuracies in the bibliographic data, and it is much more difficult to de-dupe across several databases with different metadata structures and content.
The business disciplinary area has the largest number of records (nearly 81.5 million) followed by science and technology (73.7 million), health and medicine (50.3 million), the social sciences (42.3 million), history (34.3 million), literature and language (19.8 million), and arts (18.7 million). In the FT domain the business disciplinary area remains the leading component (60.7 million documents), while history is the very distant second largest component with nearly 24 million documents (Figure 3 [Figure omitted. See Article Image.]).
The number of FT databases is 30 per cent of all the databases in the GEP-41 segment at the macro database level, if the FT records of 67,120 items in the dissertations and theses database and the 5,235 items of the Technology Research database are considered as partially FT databases. However, at the FT document level, the average FT availability in GEP-41 goes up to 50 per cent. Ignoring these two databases mentioned above brings the average for FT content to 63.2 per cent and the median to 75.5 per cent for the ten FT databases. This is a very impressive number, which reflects the culturally priceless digitised collections of newspapers and news magazines, including the legendary US newspapers. Table II [Figure omitted. See Article Image.] shows the FT availability of some of these in the National Newspapers Core database family alone.
Overall the GEP-41 databases cover in FT format more than 12,000 journals to various extents. The ratio of scholarly journals is 55.5 per cent in GEP-41, but only 4.7 per cent in its FT segment. Of course it must be realised that 1 per cent in this case means about one million FT documents from scholarly journals. This figure is an estimate because there are records for tables of contents pages, announcements, news items, and other types of front matter and back matter in each issue of the journals or serial publications in GEP-41.
Deep-indexed records
The best asset for ProQuest in the acquisition of CSA was its Illustrata database family, which represents "not merely an evolutionary step or an experimental system of a grant project, but it is a giant leap in scholarly information retrieval" ([3] Jacso, 2007, p. 53). These records have indexing terms assigned to the visual content of documents by the types of illustrations, the captions and other textual information, and including enlargeable thumbnails ([3] Jacso, 2007). Further details about the power of deep-indexed records are described by [12] Tenopir et al. (2006).
Based on test searches it is estimated that there are index entries for about 26.1 million figures and tables in about 14.7 million records in GEP-41. These deep-indexed records are far superior to the traditional I&A records by virtue of adding well-searchable metadata about the illustrations (tables, charts, graphs, photos, and other pictures) which often prove the adage and indeed may be worth more than 1,000 words. Figure 4a [Figure omitted. See Article Image.] shows a sample of these specific index terms, while Figure 4b [Figure omitted. See Article Image.] illustrates a record with the enlargeable thumbnails of the figures and tables in the paper. This is an exceptionally useful metadata element both for focusing topical searches, and guiding the users in their selection of items from a result list.
Time span of database coverage
The list of the databases on the main menu indicates - at the macro level - the time span of 31 databases, and some of them are inaccurate, one of them grossly so. There is no consensus on how to inform the users when the coverage of a source begins and ends. There are many database producers and publishers who use the year when the oldest document was published and has a record in the database, representing highly inflated values ([4] Jacso, 2009). The problem with this is that if there is only a smattering of records, for example from 1914 to 1970 in a database, it is unethical to claim 1914 as its start year of coverage which would be frustrating for users interested in the news events of the two world wars.
ProQuest does not play games with claiming the time span of coverage in GEP-41. On the contrary it often indicates the start year of coverage several years later than substantial coverage started. For example the starting year for the National Newspapers Core database family indicates in the GEP-41 database menu that its coverage starts in 1985. Actually the coverage of the top newspapers, such as New York Times , Christian Science Monitor , and the Wall Street Journal goes back to 1980. If the user needs information for a research project about events in the first term of Ronald Reagan, i.e. the first half of the 1980s, the inaccurately claimed start year may make the user overlook this excellent database family.
For ten databases there is no time span specified. For COS Scholar Universe this is understandable as it is a directory of scientists. It would be in the interest of both ProQuest and the users to proudly and truly show that Canadian Newsstand Complete goes back to 1977, the Canadian Business and Current Affair (CBCA) database to 1933, ProQuest Health and Medical Complete to 1950, ProQuest Illustrata: Technology Collection to 1930, and the Research Library to 1902. For PAIS International there is no indication for the time span and in this case it may have to do with the loading of this truly international database of nearly 500,000 records, which in GEP-41 has <31,000 records.
Many of the GEP databases were found to have significant and very significant volumes of records when limiting the search to earlier years than the claimed start year of coverage as shown in Table I [Figure omitted. See Article Image.] in the column labelled Pre-start records. This labelling refers to the fact that the publication year of these items is prior to the start year reported in the database menu of ProQuest. The absolute number of records in the pre-start period is nearly 3.4 million records for Biological Sciences, including 2.3 million MEDLINE records, over half a million for National Newspapers Core, 467,500 for ABI/INFORM Complete, 46,500 for Illustrata Natural Science, 19,000 for NCJRS, close to 18,000 for ARTbibliographies Modern and ASFA, each, above 12,000 for British Humanities Index, and nearly 11,000 for ERIC. These underestimations of the time span are not good for PR. The compromise to avoid inflating or deflating the time span is to use the year that was followed by at least five years of consecutive period with records available ([4] Jacso, 2009).
There is one database with a very incorrect start date in the GEP-41 database list: MEDLINE is reported in the database menu to be covered from 1993. There are various subsets of MEDLINE. The National Library of Medicine refers to a subset with "very selective coverage" of the sources from 1809 to 1864, and another with "selective coverage" from 1865 to 1966. Many online services split MEDLINE because of its large size (23 million records in mid-January 2013) but GEP-41 has a single MEDLINE database. Actually it does have the most comprehensive coverage, going back to 1809. Ironically GEP-41 has one of the best implementations of MEDLINE, by virtue of featuring more than 3.6 million records with figures and tables, but users may dismiss it, believing that it has records for publications of only the past 14 years.
Source and document types
The GEP-41 databases cover ten major source types substantially. Newspapers (including current and historical ones) represent the largest component (36.2 per cent), followed by scholarly journals (26.9 per cent), trade journals (12.3 per cent), wire services' feeds (10.4 per cent), magazines (3.4 per cent), government documents (2.8 per cent), conference papers (2.8 per cent), dissertations (1.6 per cent), reports (1.5 per cent), books (0.7 per cent), and other sources (1.3 per cent). In the FT segment of GEP-41 the newspaper sources and wire services' feeds make up the bulk, while scholarly journals' share plummets from 27 to 4 per cent. Trade journals have a somewhat larger share in the FT segment than in the complete GEP-41 module (Figure 5 [Figure omitted. See Article Image.]).
There are 70 document types used by the component databases in GEP-41. Regular and feature articles constitute 48 per cent of the entire GEP-41 set and news items make up 25.5 per cent of the content. In the FT segment the share of the article items goes down to 22 per cent, while the news items make up 45.5 per cent of the items by document type. This is understandable because newspapers constitute the largest set, with Canadian Newsstand Complete, National Newspapers Core, and the Historical Newspapers family's retrospective edition of The Washington Post . ABI/INFORM Complete and CBCA Complete also have a significant subset of news item. This document genre is also enhanced by a large number of news items for scientists and professionals in academic and trade journals, magazines, and scientific newsletters of academics and practitioners covered by ABI/INFORM Complete, CBCA Complete, and in the ProQuest Research Library databases. It is to be noted that there are nearly 20 million records which do not have any document type metadata.
Conclusions
The single module, GEP-41, is an important contribution in itself to LIS education, providing free access to LIS faculty and librarians to so many databases covering the LIS and LIS-related fields (computer science, communications science, education), including the brand new ProQuest Library and Information Science database with more than 1.2 million items of which 67 per cent are available in FT.
The other modules of GEP extend the coverage to databases appropriate to LIS faculty and students interested in various tracks of librarianship ([6] Jacso, 2013). This project certainly will benefit ProQuest itself in the long run. Future librarians who will be educated about and experienced in searching the variety of databases showing the ProQuest logo, will have a very favourable memory of the brand even at a time when they will be involved in making decisions about which databases to licence on which platforms.
The generosity of this endowment is clear from the fact that all the components are equivalent to the licenced editions, not just subsets of databases or limited functionality versions of software. From the perspective of the primary beneficiaries, the LIS professors and students, this endowment package makes ProQuest in the digital age the equivalent of the House of Medicine in the fifteenth and sixteenth century for artists. The very rich infrastructure for this project offers unprecedented opportunities for a digital renaissance in every aspect of LIS education and research for the haves and the have-nots.
1. Asher, A.D., Duke, L.M. and Wilson, S. (2013), "Paths of discovery: comparing the search effectiveness of EBSCO Discovery Service, Summon, Google Scholar, and Conventional Library Resources", College & Research Libraries, Vol. 74 No. 7 (forthcoming).
2. Hensley, M.K. and Kern, M.K. (2011), "Citation management software: features and futures", Reference and User Services Quarterly, Vol. 50 No. 3, pp. 204-208.
3. Jacso, P. (2007), "CSA Illustrata", Online, Vol. 31 No. 3, pp. 53-54.
4. Jacso, P. (2009), "Database source coverage: hypes, vital signs and reality checks", Online Information Review, Vol. 33 No. 5, pp. 997-1007.
5. Jacso, P. (2012), "Analysis of the Ulrich's Serials Analysis System from the perspective of journal coverage by academic databases", Online Information Review, Vol. 36 No. 2, pp. 307-319.
6. Jacso, P. (2013), "How the components of the ProQuest Graduate Education Program can be used in educating LIS students for various library types" (under review).
7. Keene, C. (2011), "Discovery services: next generation of searching scholarly information", Serials: The Journal for the Serials Community, Vol. 24 No. 2, pp. 193-196.
8. Register, R., Cohn, K., Hawkins, L., Henderson, H., Reynolds, R., Shadle, S.C., Hoffman, W., Rajan, S. and Yueb, P.W. (2008), "Metadata in a digital age: new models of creation, discovery, and use", The Serials Librarian, Vol. 56 Nos 1-4, pp. 7-24.
9. Rowe, R. (2011), "Encore Synergy, Primo Central", The Charleston Advisor, Vol. 12 No. 4, pp. 11-15.
10. Shaffer, B.A. (2011), "Graduate student library research skills: is online instruction effective?", Journal of Library & Information Services in Distance Learning, Vol. 5 Nos 1-2, pp. 35-55.
11. Su, S.F. and Kuo, J. (2010), "Design and development of web-based information literacy tutorials", The Journal of Academic Librarianship, Vol. 36 No. 4, pp. 320-328.
12. Tenopir, C., Sandusky, R. and Casado, M. (2006), "The value of CSA deep indexing for researchers (executive summary)", prepared for CSA, available at: http://works.bepress.com/carol_tenopir/1 (accessed 6 January 2013).
13. Wang, Y. and Mi, J. (2012), "Searchability and discoverability of library resources: federated search and beyond", College & Undergraduate Libraries, Vol. 19 Nos 2-4, pp. 229-245.
14. Way, D. (2010), "The impact of web-scale discovery on the use of a library collection", Serial Review, Vol. 36 No. 4, pp. 214-220.
Peter Jacso, Department of Information and Computer Sciences, University of Hawaii, Honolulu, Hawaii, USA
Figure 1: I&A and full-text coverage of Online Information Review in many unexpected
Figure 2: The content and software components of the global GEP service
Figure 3: Relative distribution of all items and FT items by major disciplinary areas
Figure 4: (a) Excerpts of the types and volumes of figures and tables in GEP-41. (b) Part of the result list showing thumbnails of the figures and tables
Figure 5: All items and FT items by source types in GEP-41
Table I: The profile of the GEP-41 databases/database families at the macro level
Table II: More than 9.5 million FT items from the world best known newspapers
Copyright Emerald Group Publishing Limited 2013
