How do you construct a taxonomy of medicine concise enough to mount on a single Web page and comprehensive enough to describe the contents of a set of leading journals whose coverage ranges from public health to molecular biology? This seemingly impossible task was recently presented to the publishing division of the American Medical Association (AMA) to improve subject access to its journals' Websites.
Topical indexing of medical Websites is most familiar from the broad lists of diseases and conditions designed to provide easy access to consumer health information. Indexing of the professional literature, on the other hand, has largely remained in the domain of MEDLINE. With AMA's indexing already based on MEDLINE's Medical Subject Headings (MeSH) vocabulary, creation of a Web taxonomy became a matter of adapting the multiple hierarchies of the more-than-1,000-page MeSH thesaurus to a concise listing of topics that would meet Web publication's requirements for an easy-to-display index while retaining MeSH's inclusivity and scientific validity. AMA took as its precedent the simplified access to MeSH built into the Catalogue and Index of Health-related Internet sites in French (CISmeF) at the University of Rouen [1].
Creating and implementing the taxonomy was a twofold process: first, selecting and arranging terms and, then, constructing a lookup table mapping each of the selected terms to corresponding terms in the MeSH vocabulary. In this way, the broader terms of the Web taxonomy could be automatically associated with the MeSH indexing already in use, avoiding the time-consuming chore of hand tagging the back file of articles. Using MeSH as the basis of the Web taxonomy also ensured that the electronic index shared not only the comprehensive coverage and consistency of a well-maintained controlled vocabulary but also the advantages of its hierarchical trees.
SELECTION AND DISPLAY OF SUBJECT TERMS
Selecting and arranging subject terms involved surveying the most frequently assigned MeSH terms and comparing them with a list of topics drawn up by editors that, in their judgment, best highlighted the content of their journals. Making a coherent whole of the disorganized assortment of topics and subject terms that resulted was not a simple job, not big enough for consultants or stand-alone software but still requiring staff time, forethought, and numerous iterations. For help, those involved in the project consulted the published sources on taxonomies and their construction.
Web postings offered much information on cutting-edge applications of existing taxonomies, often relating to such innovative interfaces as topic maps, but relatively little on the principles and mechanics of choosing and organizing appropriate subject terms-the ground floor principles of building a taxonomy-as opposed to employing an established vocabulary in conjunction with the emerging standards and architectures of the networks. AMA indexers and editors, therefore, went further back to the library literature on such subjects as indexing, vocabulary control, and thesaurus construction. This literature predates but often still underlies today's dynamic applications of information technology. Lancaster's much cited Vocabulary Control for Information Retrieval [2] was of particular help.
Taxonomies in general are best known as hierarchical arrangements of terms that describe a particular branch of science or field of knowledge. Ideally, terms are selected and arranged to be mutually exclusive, thus creating an ordered universe with a place for everything and everything in its place. Unfortunately, medicine does not lend itself well to such pure rationalism. Many diseases, not falling neatly into one category or another, require multiple postings. The term "Lymphomas," for example, appears in three MeSH hierarchies under "Neoplasms," "Hemic and Lymphatic Diseases," and "Immunologic Diseases." "Multiple Sclerosis" is listed as an "Immunologie Disease" and then posted twice as a "Neurologic Disease," once as a narrower term under "Autoimmune Diseases of the Nervous System" and again under "Demyelinating Diseases." As a result, the sprawling MeSH schedules tend to resist the compression and simplification sought in indexes intended for the graphic interfaces of the Web. Lancaster's caveat on the trade-offs involved in reconciling the strict logic of a hierarchy with user-friendly display is especially pertinent to the screen-by-screen environment of electronic publishing: "Extremely large hierarchies involving multiple relationships and levels, however, are difficult to display intelligibly in graphic form. Moreover they tend to waste space" [2].
FACETED ALTERNATIVE
Faceted indexing is often employed to create more concise, Web-friendly displays. It is widely used by Web retailers to index offerings such as apparel or appliances with each distinct product line posted once in alphabetical order or in simple classifications and then accessed through a series of secondary attributes, or facets, such as model number, size, price, or color [3]. In such an indexing system, "Multiple Sclerosis" would be posted only once, as opposed to its three postings in MeSH, and then qualified to allow searchers to address that aspect of the disease of greatest concern to them, for example, "Multiple Sclerosis/immunological aspects of," ". . . /neurologic aspects of," and so on. Faceted indexing is offered, in part, through MEDLINE's MeSH browser, where, after selecting a term from a hierarchy, searchers may qualify it by checking off a series of subheadings that allow access to that part of the literature most relevant to their special interests.
For the most part, however, while faceted indexing can be successfully applied to such objects as, say, refrigerators-which may be distinguished one from another by height and width, storage capacity, price, and color-it is not entirely well fitted to the complex, interrelated systems and concepts pertaining to organic life forms. Even the qualifiers that may be added to MeSH headings (diagnosis, etiology, treatment, etc.) operate only in relation to the in-depth indexing performed by the National Library of Medicine (NEM) on the millions of articles in their database. The sheer volume and range of data in MEDEINE exhibits a granularity, to use information technology (IT) terminology, that makes a partially faceted retrieval scheme possible. Less voluminous databases generally will not support distinctions such as that between the pathology and physiopathology of a disease as are made on MEDLINE.
IMPLEMENTATION
To meet needs of the both neurologists and immunologists without resorting to facets and without the means to construct elaborate custom-made hierarchies, AMA's simplified taxonomy retained the often arbitrary boundaries of medical specialties by mapping topics only to the most appropriate occurrences of matching or equivalent MeSH terms in the MeSH trees. The topic "Immunologie Diseases," therefore, was mapped to pick up "Multiple Sclerosis" in MeSH schedule C20, while "Neurologic Diseases" was mapped to pick up the same disease in schedule C10 "Diseases, Neurologic" (even though the articles retrieved would be the same). Mapping to the most appropriate occurrences in MeSH trees, however, allowed the capture of the pertinent narrower terms listed beneath the matching MeSH term. Clicking on "Neurological Diseases/Multiple Sclerosis," therefore, picks up articles indexed under the narrower MeSH terms listed beneath "Multiple Sclerosis," such as "Neuromyelitis Optica." Although it is only used selectively and not apparent to the user, the MeSH hierarchy remains the implicit authority behind most entries in the Web taxonomy.
The final taxonomy was based on fifty-three general topics derived from established specialties such as dermatology and rheumatology, recognized diseases and disease groups such as cardiovascular diseases and infectious diseases, therapies and diagnostic techniques ("Drug Therapy" and "Radiologic Imaging"), and patient groups ("Men," "Women's Health," "Pediatrics," "Geriatrics"), as well as a mixture of miscellaneous topics such as "Internet in Medicine" and "Quality of Life." These topics were arranged in alphabetic order and subdivided, where necessary, along traditional lines or according to journal content. "Cardiovascular System," for example, was subdivided into "Arrhythmias," "Myocardial Infarction," "Congenital Heart Defects," "Congestive Heart Failure," and "Thromboembolism," as well as a series of more widely discussed interventions as follows:
* Cardiovascular System
* Arrhythmias
* Cardiovascular Disease/Myocardial Infarction
* Congenital Heart Defects
* Congestive Heart Failure/Cardiomyopathy
* Cardiac Diagnostic Tests
* Cardiovascular Interventions
* Revascularization
* Pacemakers/Defibrillators
* Thrombolysis
* Cardiovascular Interventions, Other
* Venous Thrombosis
* Cardiovascular System, Other
Subdivisions resulted in 374 topics and subtopics, few of which, in the end, were mutually exclusive-most articles fell into multiple categories. Articles whose principle topic is "Colonoscopy" would be accessed by clicking either on "Gastrointestinal Diseases" or on "Colon Cancer" in the topic index. An article on the effects of hydrochlorothiazide on hypertension and cardiovascular disease among the elderly would be found by clicking on "Hypertension," "Cardiovascular Disease," or "Aging/Geriatrics."
In most cases, mapping the topics to MeSH indexing terms was straightforward, frequently a one-to-one correspondence. However, in some instances, editors viewed their content quite differently than the editors of the MeSH schedules, and, for the sake of using terminology known to readers who work in the field, AMA did not insist on sticking to terms established for indexing purposes. Neurology, for example, was subdivided not only into topics representing major neurological disorders such as "Alzheimer's Disease" and "Parkinson's Disease," but also into interdisciplinary fields such as "Neuroendocrinology" and "Neuroophthalmology," for which mappings were not only to the diagnoses that editors associated with these topics, but also to combinations of the MeSH indexing terms, (e.g., "Grave's Disease" AND "Optic Neuropathy").
CONCLUSION
In the wider world, the development of taxonomies has come to constitute a big ticket item in Web publishing and corporate intranets, with development costs estimated in a recent report at half a million dollars for collections of 500,000 pages or more [4]. In medicine, costs tend to be higher than for other disciplines, usually involving the development of complex rules for applications of the Unified Medical Language System (UMLS), a synthesis of more than 100 biomedical vocabularies that forms the backbone of most automated medical indexing systems [5].
AMA built its taxonomy, on the other hand, to meet immediate needs in the normal workflow of busy editorial and production departments. As a result, much of the fine-tuning required for interpretation of ambiguity in the indexing terms posted for the compilation of print indexes proved impractical. Therefore, not all relevant articles could automatically be tagged for topic collections, including content relating to such high interest topics as bioterrorism and severe acute respiratory syndrome (SARS). Such content, often related to current events or emerging research findings, required hand tagging. Overall, however, a sufficiently unambiguous correspondence existed between existing indexing terms (a subset of MeSH vocabulary) and taxonomy topics to achieve an acceptable degree of relevancy and recall in automatically tagging the more than 14,000 articles published in JAMA and the Archives over the past five years. (To judge relevancy and recall, collections were selectively cross-checked against sets retrieved from PubMed using corresponding search terms or combinations of search terms for AMA titles only.)
Would we recommend AMA's low-cost, high-yield approach to others? Yes and no. The success of the project largely depended on the controlled MeSH vocabulary already embedded in the standardized general markup language (SGML) of AMA's journals. This advantage allowed editors, indexers, and programmers to get a handle on content without resorting to the relatively expensive and time-consuming use of tools such as UMLS to analyze keywords in titles and abstracts. The volume of data, the necessarily large number of topics in the all-medicine taxonomy, and the short time frame from conception to implementation, on the other hand, still made the task a challenging one. Many decision makers might prudently have opted out of undertaking the project in house-with its attendant, not always safe, assumption of a sufficient supply of time, imagination, and cooperation among staff in different departments-in favor of sending the work out for pricey contractual delivery or, more likely given the costs of handling a medical vocabulary, have squashed the initiative entirely. Finally, however, a balance of cost constraints and needs of clinician readers to access content on their own terms argued in favor of the pragmatic approach taken. Determining whether or not the collections accessible through the taxonomy have met their goals will require both asking selected readers about the usefulness of this feature and assessing Web traffic to the page. The taxonomy may be viewed at http://pubs.ama-assn.org/collections/.
REFERENCES
1. THIRION B, DARMONI SJ. Simplified access to MeSH tree structures on CISMeF. Bull Med Libr Assoc 1999 Oct;87(4):480-1.
2. LANCASTKR FW. Vocabulary control for information retrieval. 2nd ed. Arlington, VA: Information Resources Press, 1986.
3. MURRAY P. Faceted classification of information. [Web document]. The Knowledge Management Connection. [cited 27 Aug 2004]. <http://www.kmconnection.com/DOC100100.htm>.
4. SYKES J. The value of indexing. [Web document]. Factiva/Dow Jones, 2001 Feb. [cited 27 Aug 2004]. <http://www.factiva.com/collateral/files/whitepaper_valueofindexing_042001.pdf>.
5. BERRIOS DC, CUCINA RJ, FAGAN LM. Methods for semi-automated indexing for high precision information retrieval. J Am Med Inform Assoc 2002 Nov-Dec;9(6):637-52.
Received January 2004; accepted May 2004
By Brace McGregor
Division of Publishing Operations
American Medical Association
515 North State
Chicago, Illinois 60610
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Medical Library Association Jan 2005
Abstract
McGregor discusses how to construct taxonomy of medicine that is concise enough to mount on a single Web page and comprehensive enough to describe the contents of a set of leading journals whose coverage ranges from public health to molecular biology. Creating and implementing the taxonomy was a twofold process: first, selecting and arranging terms and, then, constructing a lookup table mapping each of the selected terms to corresponding terms in the MeSH vocabulary.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer