Introduction
Peer-reviewed publications constitute the body of knowledge upon which biomedical research is based. This resource is essential for the generation of novel hypotheses and the design of studies, experiments, and trials and is thus key to the discovery process itself. However, the volume of literature on any given research topic has grown dramatically in recent years, making it increasingly challenging for a single person to manually survey the relevant literature in its entirety, and consequently to acquire the foundational knowledge needed for discovery research. Hence, developing a solid foundation of skills for literature retrieval and profiling is of critical importance for early career biomedical scientists. These competencies will, for instance, be needed: 1) to acquire a sound knowledge base through developing the ability to compile and summarize large volumes of literature. 2) to develop data interpretation skills as well as become able to assess novelty and potential impact of a given finding; and 3) to develop scientific writing skills and become able to write background material on specific topics.
Publicly available omics data provides ideal material for training new generations of biomedical researchers. 1– 3 One of the “collective omics data” training modules that we have developed follows a reductionist analysis and interpretation workflow using publicly available transcriptome data as the source material. 4 The first activities that form part of this training module involve literature retrieval and profiling for a given candidate gene.
In this article, we describe a detailed stepwise approach that we have developed running literature profiling training workshops - literature profiling here meaning information extraction from titles and determination of keyword frequencies in titles and abstracts. The training program that we have devised takes advantage of publicly available omics data. 4 The training module presented here focuses on retrieval of gene-centric literature. Supporting material, such as sets of slides, templates, and a handout, is also provided along with an illustrative use case. Notably, although the training activity that is presented focuses on gene-centric literature retrieval and profiling, the skills and approaches can be adapted to any other application (e.g., focusing on disease-centric, or pathway/process-centric literature).
Methods
The hands-on training exercise described here is suitable for implementation as part of a broader course on research methodology, or as a stand-alone workshop. The training is appropriate for undergraduate, graduate and post-doctoral trainees and no prior bioinformatics experience is required in order to participate. The time commitment depends on the level of experience of the attendees, the overall organization of the workflow and the volume of literature associated with the candidate gene(s) selected. For instance, for a gene with about 1,000 associated articles, five participants could work together on this same gene, each focusing on a different theme. Generally, the training could be covered in one introductory session (40 min), and three two-hour hands-on sessions. The format and content of these sessions are described in more detail below. These would cover:
• Introductory session • Step 1 – Retrieving the relevant literature • Step 2 – Extracting concepts • Step 3 – Generating literature profiles • Step 4 – Developing interactive data visualizations • Step 5 – Writing a narrative
In addition to covering workshop pre-requisites and time commitment, announcements for such workshops can also list the sets of skills trainees are expected to develop including:
• Literature retrieval (development of advanced PubMed queries). • Literature profiling (information extraction, determination of keyword frequencies). • Data visualization (structuring and presenting information in an interactive format). • Extraction of biomedical information (capturing information in a structured format).
Trainees can be asked to perform tasks between each of the sessions. Alternative formats are possible depending on the needs of the participants and a range of biomedical themes (e.g., diseases, pathways, or cell types) could be selected as the focus of the literature workshop. The endpoint for such workshops may include interactive literature profiling representations generated as part of the hands-on training activities and/or a peer-reviewed publication that, for instance, could build on such a resource.
Introductory session
The introductory session is designed to provide participants with an understanding of basic concepts and present an outline of the training curriculum. In addition, this session presents the overall rationale and teaching objectives of the training program. The introduction also defines the endpoint of the workshop activity as the development of a web resource that permits the visualization and exploration of structured literature profiles for a gene of interest (described in detail below, taking
Step 1: Retrieving the relevant literature
As a first step, all the literature that is relevant to a given gene of interest must be identified. This forms the body of literature that will be subjected to literature profiling in subsequent steps. Most researchers will already be familiar with PubMed, the search engine hosted by the US National Center for Biotechnology Information (NCBI) ( https://pubmed.ncbi.nlm.nih.gov/). While this tool is straightforward to use, developing queries that will permit the comprehensive retrieval of literature associated with a given theme can be more complicated.
It is important to design a PubMed query that will permit the retrieval of all the literature that is available for the gene of interest. Challenges when retrieving literature associated with a given gene include capturing all aliases and variations that may exist in addition to the official gene names and symbols. For instance, 14 aliases were used to develop a query for the gene officially known as “
The endpoint of this step is the development of an optimized PubMed query for a given gene. Any criterion can be used for the selection of a candidate gene to be assigned to a trainee or group of trainees. However, a significant body of literature is associated with the gene in question. For the purposes of illustration in this article, we have selected
Practical activities for this step:
The official gene name, symbol and all aliases are retrieved from the GeneCard website (e.g., for
PubMed queries are built using the official gene name, official symbol, and all aliases for the gene of interest as search terms, and by using the appropriate Boolean operators (AND, OR, NOT), field restriction tags, and suitable syntax. For instance, in a query using multiple search terms, the Boolean operator OR must be capitalized in between each keyword. The field restriction tag [tw] can be used after each term to search “text words”, included, for instance, in title, abstract, MeSH terms and subheadings, publication types, and substance names. Or alternatively the more restrictive [tiab], can be used in order to limit the search to titles and abstracts only. A query using [tw] for multiple search terms could for example be noted as follows: ISG15 [tw] OR IFI-15K [tw] OR IFI15 [tw]. In addition, the field restriction tag [pt], can be used to restrict/exclude the search of a particular publication type (e.g., NOT review [pt], would exclude review articles from the search results). Quotation marks are employed when compound words appear as an exact phrase in the search (e.g., “
https://pubmed.ncbi.nlm.nih.gov/help/
https://learn.nlm.nih.gov/documentation/training-packets/T0042010P/
The PubMed query is run, and quality checks are performed among the publications that were returned. This is to identify search terms that may be too permissive and return false-positive results. For instance, short three-character acronyms tend to be more problematic, as these are terms that are otherwise used as part of a common language (e.g. CAMEL, WARS). If necessary, the query is optimized by addressing problematic terms. Removing the ambiguous or problematic term altogether may be a solution, but this could also lead to false-negative results (missing literature that is actually relevant to the gene of interest). As a compromise, the search term could be retained but optimized by adding a keyword that would restrict the search (e.g., provided below for HUCRP and UCRP). One may also find in some instances that the list of aliases known for a given gene is incomplete and needs to be amended (e.g., below).
Step 2: Extracting concepts
A large body of literature can be associated with a given gene. It may be useful then to employ a systematic “cataloguing” or indexing approach. For this step, workshop participants may first define themes under which concepts, and keywords associated with these concepts, will be categorized (e.g., themes could be ‘Human diseases and pathogens’, ‘Tissues’, ‘Cell types’, or ‘Cellular processes’). Participants would in turn scan titles of articles associated with
Notably, this process could be repeated for other themes (or different themes could be assigned to each of the participants). If the literature is extensive and time is limited, it would also be possible to divide the literature into subsets (e.g., by batches of 100 articles, with participants assigned different batches to work on for the same theme). If the literature is sparse, all available articles for the gene may be used rather than just focusing on those articles including search terms in titles (i.e., using [tw], instead of [ti] in Step 2a). However, it may be generally preferable for a workshop to be based on selected genes with a relatively abundant literature (e.g., >100 articles returned when restricting the search to titles).
Practical activities for this step:
A subset of the literature is retrieved, restricting the search to titles (using the optimized query from Step 1, and substituting the field restricting argument [ti] for [tw]). A given theme is assigned to, or selected by, participants (e.g., ‘Human diseases and pathogens’, ‘Tissues’, ‘Cell types’, ‘Biomolecules’, ‘Pathways’, ‘Biological processes’). Concepts relevant to the theme in question are identified in the titles of articles retrieved by the query designed in a) (e.g., liver cancer, HIV). The concepts and associated keywords (e.g., ‘hepatocellular carcinoma’, ‘liver carcinoma’, ‘hepatic cancer’, for the concept ‘liver cancer’ or ‘virus’, ‘viral’, ‘human immunodeficiency virus’ for the concept HIV) are recorded in a spreadsheet (example and templates are available in
Extended data File 2).
6
Step 3: Generating literature profiles
Determining the relative prevalence of concepts among the literature associated with a given gene can be useful. This is the case when assessing the novelty of a finding, for instance (e.g., change in transcript or protein abundance associated with pathogenesis). Moreover, such an exercise and the information being derived would also be useful in other instances when writing general background/summary about the gene for a report or a manuscript.
With the two previous steps completed, determining the prevalence of concepts in the literature associated with a given gene can be achieved quite simply. For this, the literature query developed in Step 1 is modified to narrow the search and retrieve literature associated with the concepts identified in Step 2. The endpoint for this activity is a table showing the frequency of articles for these concepts in the literature associated with a given gene (e.g., Extended data File 2 6 [Excel File: Literature Profiling Tab]).
Notes: Participants may also be encouraged to explore different types of visual representations for this type of data. Treemaps or word clouds can for instance show relative prevalence, while other types of graphs, such as 2D bubble graphs, may also show article frequencies (e.g., Figure 1). Another exercise could involve visualization of changes in the abundance of the literature associated with the selected concepts over the years. For this purpose, queries can be amended to add a range of publication dates with the field restriction tag [dp] (e.g., adding AND 2000:2010 [dp] to the query).
Figure 1.
Visualizing
Treemap representation of the relative prevalence of concepts associated with the “human diseases and pathogens” theme among the ISG15 literature.
Practical activities for this step:
The literature query from Step 1 is employed (using the field search restriction [tw]). The Boolean operator AND is added, followed by search terms corresponding to keywords related to one of the concepts identified in Step 2 (e.g., “Liver cancer”, “Liver carcinoma”, “Hepatic carcinoma”). Quotation marks, field search restriction and the Boolean OR are added, so that the notation would read as follows: … AND (“Liver cancer” [tw] OR “Liver carcinoma” [tw] OR “Hepatic carcinoma” [tw]). The query thus constructed is run, and the number of articles retrieved recorded (for instance in a spreadsheet:
Extended data File 2).
6
Steps b through d are repeated for the rest of the concepts identified in Step 2. Visual representations of the literature profiles are generated.
Step 4: Developing interactive data visualizations
Valuable insights can be gained from visual representation of information. And this could be further facilitated when the information underlying these visual representations can be accessed interactively by the end user.
To produce such interactive visual representations in the context of training workshops, we recommend using the Prezi web application ( https://prezi.com). This tool has been developed to create presentations in which it is possible to zoom in and out between levels of information. This gives users the opportunity to visualize the prevalence of concepts while at the same time allowing them to “drill down” into each individual concept in order to access relevant underlying information. The endpoint for this activity is the production of a “hierarchical circle packing chart” representing the relative abundance of concepts in the literature for a given gene and theme ( Figure 2A, https://prezi.com/view/zCedrcYaAEUAON1VeEUi). Such a resource gives users access to underlying reference information for each of the concepts ( Figure 2B) and can be made available publicly.
Figure 2.
At the highest level, the representation permits visualization of the relative abundance of concepts associated with the “human diseases and pathogens” theme for the
For the practical activity designed for this step, participants will need to register with Prezi and create an account to access and edit the interactive presentations. They can do so free of charge by selecting the basic account at: https://prezi.com/pricing/basic/. Ideally, this is completed ahead of the training session. If instructors have access to a paid subscription, they can create an unlimited number of presentations and invite individual participants to collaborate with editing rights. Otherwise, the participants can create their own presentation and invite the instructors as collaborators. They would also need to copy and paste material from a master template in their own Prezi account. Thus, this alternative solution would be workable, but probably not ideal.
Practical activities for this step:
Participants are given access to a Prezi presentation that contains a template and illustrative example (e.g.
https://prezi.com/view/u65ZqHn9ZBJx1VKHqfZA/). As indicated above, one such presentation could be created and made available for each participant. It could serve as their own “Sandbox”, to familiarize themselves with the application and use as a starting point to develop an interactive resource. Multiple users can also work simultaneously on the same shared presentation. It would thus also be possible to create one presentation for a group of participants to work on cooperatively. Starting with the template and following the illustrative example, participants can build a circle packing chart for concepts identified in Step 3. The size and color of the circles is determined by the frequency of articles in the gene-associated literature related to a given concept (as shown in
Figure 2A). The circles can be arranged manually to create a visually attractive representation. Underlying information is then added to each of the circles and accessed by zooming in to different levels (
Figure 2B). This information could include, for instance: • The gene symbol and concept; • The query link; • Number of articles retrieved on a specified date; • Result link to those articles; • Screen capture and links pointing to articles relevant to a specific topic (see Step 5 below).
A screen recording demonstrating how to add underlying information to a Prezi circle packing chart can be accessed via: https://soapbox.wistia.com/videos/2tmC1VeQyr
Step 5: Writing a narrative
Writing material for a report or manuscript can be one of the motivations for profiling the literature associated with a given topic. To become successful in their future academic endeavors early career scientists need to become proficient independent writers. The last hands-on activity of the workshop consists in capturing key information from published work and using it as a basis for writing up a narrative about the gene of interest.
A specific topic would be selected as the focus of the activity to limit the amount of literature that would need to be covered. One such topic could, for example, be the relevance of a gene (in this case
Practical activities for this step:
Concepts most prevalent in the gene-associated literature for the theme of interest are selected (e.g. concept = Hepatitis C infection, within the theme = “Human diseases and pathogens”). The query developed to retrieve literature for a given concept is modified to restrict the search in order to retrieve articles relevant to a given topic (e.g. relevance of the gene as a biomarker). The Boolean “AND” is appended at the end of the query along with relevant keywords in parenthesis separated by the Boolean “OR” (e.g. “AND (biomarker OR biomarkers OR diagnostic OR diagnosis OR prognostic OR prognosis)”. Articles are reviewed and those deemed relevant to the topic are selected (e.g., those in which changes in abundance of the gene product in clinical specimens are reported). Relevant information is captured from the abstract and/or full text and recorded in a spreadsheet (e.g. analyte name, species, data generation methods, comparator groups, etc.). In some cases, the full text of articles may not be accessible (e.g., paywall), and the abstract has only incomplete information. This can lead to the absence of important information in the “capture” spreadsheet, rendering the findings unusable. Participants can then be reminded of the importance of using best practices when reporting findings (e.g., mentioning the comparator group, species or the specific factor being measured – being protein or RNA). It may also be an opportunity to discuss the merits of publication in open access journals. The information captured in the spreadsheet will serve as a synopsis that participants will be able to rely upon for developing a written narrative, which could be used in an introduction or review article.
Implementation
As proof of principle, a workshop was implemented using the step-by-step guide and supporting information provided with this paper. It was led by FAA and AKM, who participated in the development of the training curriculum, but had no prior experience running such training activities. The workshop took place on January 26, 2021. In total, 29 graduate students from Hamad Bin Khalifa University took part. It was offered as one three-hour class as part of an “Introduction to Data Science” course. Due to Covid-19 restrictions the workshop was run remotely using Webex. As no information was collected from workshop participants (i.e., no surveys, or questionnaires), the activity was not considered to constitute human research and therefore no ethical approval was required. The generic introductory presentation in
Extended data File 1
5
was adapted to provide more specific context, and notably explain how,
Conclusions
Effectively harnessing biomedical literature is one of the most fundamental skills required by biomedical researchers. Thus, early career researchers need to develop the ability to read the scientific literature at different levels: from “scanning” a large number of articles to reading the full text version for a more in depth understanding. Given the current rate at which articles are published, developing more systematic approaches to literature profiling based on defined principles may prove especially beneficial to early-career biomedical researchers.
Here, we present a training workflow as well as supporting material that may be re-used/adapted for the organization of ‘Gene literature retrieval, profiling and visualization’ training workshops. Hands-on activities range from literature retrieval and optimization of PubMed queries, to the development of interactive resources and authoring original material focusing on a specific topic.
A number of the steps described could easily be automated. For instance, the concatenation of official gene names and aliases for retrieving gene-specific literature using PubMed (Step 1). Indeed, such tools are under development by our group and will be made available in the future. These tools would not only save time, but also minimize user error. Nevertheless, some of the steps, such as optimization of search queries or extraction of information from abstracts will always require some critical evaluation and decision making that cannot be automated. Refraining from the use of such automated tools in a training workshop may also be a deliberate choice on the part of the instructor if the emphasis is initially on the development of competency in completing the steps involved throughout the process.
Here, we focused on gene-associated literature, simply because the training curriculum that is currently under development is based on the reuse of publicly available gene expression data. A similar approach could be employed to profile the literature associated with any given disease, pathway, molecular process, or drug, for instance.
We aim to further develop an illustrative case that would lead to the publication of a review of the
Data availability
Underlying data
No newly generated data are associated with this article. Information was retrieved via the literature search engine PubMed ( https://pubmed.ncbi.nlm.nih.gov/).
Extended data
This project contains the following extended data:
Figshare: Literature profiling workshop, introduction session slides (S File 1), https://doi.org/10.6084/m9.figshare.13669070.v2. 5
Figshare: Literature profiling workshop: steps 1-3 (S File 2), https://doi.org/10.6084/m9.figshare.14160329.v3. 6
Figshare: Literature Profiling Workshop: Step 5 (S File 3), https://doi.org/10.6084/m9.figshare.14161484.v1. 7
Figshare: Literature profiling workshop: HBKU handout (S File 4), https://doi.org/10.6084/m9.figshare.14166395.v1. 22
Figshare: Literature profiling workshop: HBKU intro presentation (S File 5), https://doi.org/10.6084/m9.figshare.14166500.v1. 23
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC BY 4.0).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2023 Al Ali F et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Early-career researchers must acquire the skills necessary to effectively search and extract information from biomedical literature. This ability is for instance crucial for evaluating the novelty of experimental results, and assessing potential publishing opportunities. Given the rapidly growing volume of publications in the field of biomedical research, new systematic approaches need to be devised and adopted for the retrieval and curation of literature relevant to a specific theme. In this context, we present a hands-on training curriculum aimed at retrieval, profiling, and visualization of literature associated with a given topic. The curriculum was implemented in a workshop in January 2021. Here we provide supporting material and step-by-step implementation guidelines with the ISG15 gene literature serving as an illustrative use case. Workshop participants can learn several skills, including: 1) building and troubleshoot PubMed queries in order to retrieve the literature associated with a gene of interest; 2) identifying key concepts relevant to given themes (such as cell types, diseases, and biological processes); 3) measuring the prevalence of these concepts in the gene literature; 4) extracting key information from relevant articles, and 5) developing a background section or summary on the basis of this information. Finally, trainees can learn to consolidate the structured information captured through this process for presentation via an interactive web application.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer




