Content area
ABSTRACT
The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence‐based decision‐making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi‐faceted visual summaries and semantic grouping capabilities to resolve cross‐document data inconsistencies. A within‐subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high‐quality structured data more efficiently than baseline methods. We discuss design implications for human–AI collaborative systems supporting data extraction tasks.
Introduction
The rapid advancement of scientific research has led to unprecedented growth in research literature across various disciplines. Extracting and synthesizing structured knowledge from this vast information landscape has become increasingly crucial for advancing scientific understanding and informing evidence-based decision-making. Within this process, data extraction—the identification and structuring of relevant information from scientific literature—is a critical stage where efficiency and precision are paramount (Taylor et al. 2021), particularly in time-sensitive domains. A relevant example is the early COVID-19 pandemic, where researchers urgently needed to determine the safety of breastfeeding for women with COVID-19 (World Health Organization 2020). This required the rapid and accurate extraction of data on experimental conditions (e.g., population demographics, study settings) and health outcomes from a rapidly expanding body of literature.
The structured data resulting from this process, often organized into tables, is essential for systematic comparison across studies, quantitative meta-analyses, and drawing comprehensive conclusions from diverse sources of evidence. Such data is crucial for organizations like the World Health Organization (WHO) in developing and disseminating timely, evidence-based guidelines (World Health Organization 2014).
Despite its importance, data extraction remains a cognitively demanding and time-consuming task. Researchers often need to manually distill relevant information from multiple papers, switching between different documents and data entry tools. This process is not only inefficient but also prone to inconsistencies and errors, highlighting a critical need to streamline the data extraction process. Addressing this need presents several challenges: (1) Multimodal information in literature. Scientific papers often contain diverse modalities of information, such as text, tables, and figures. The multimodality adds complexity to identifying the relevant information within each modality scattered throughout a paper and integrating it into a coherent and structured format. (2) Variation and inconsistencies across literature. The style, structure, and presentation of the papers can significantly vary from one to another. The variation and inconsistencies make it difficult to standardize the information. For example, the same concepts may be described using different terminologies or measurement units. (3) Flexibility and domain adaptation. Users may have varying research questions for a collection of papers, and these papers can span across different domains. Therefore, the system must be flexible enough to adapt to the diverse data needs of different users and domains.
Existing approaches to address these challenges have shown promise but face several key limitations. Tools and methods (Lo et al. 2023; Beltagy et al. 2019; Kang, Sun, et al. 2023; Nye et al. 2020; Lehman et al. 2019) focusing on extracting keywords, tables, and figures from documents help narrow the scope of relevant information but lack the flexibility to accommodate diverse extraction needs. Users still need to manually explore, select, and integrate relevant data. Question-answering systems (Open AI 2024; Anthropic 2024) have improved flexibility by allowing users to formulate information needs as queries about document content. However, these systems often produce unstructured text outputs, requiring significant effort to organize into desired structures. While some systems (Elicit 2023) present information in tabular formats, they fall short in standardizing the data and resolving cross-document inconsistencies.
To address these limitations, we present SciDaSynth, an interactive system designed to empower researchers to efficiently and reliably extract and structure data from the scientific literature, especially after completing a paper search and screening. By leveraging large language models (LLMs) within a retrieval-augmented generation framework (RAG) (Lewis, Perez, et al. 2020), the system interprets users' queries, extracts relevant information from diverse modalities in scientific documents, and generates structured tabular output. Unlike standard prompting, which relies solely on a model's pretrained knowledge (and can be outdated due to the LLM's training cutoff), RAG dynamically retrieves and integrates up-to-date, domain-specific information into prompts. By injecting the retrieved information into the generation process, RAG reduces hallucinations and improves factual accuracy (Ji et al. 2023). To further ensure data accuracy, the system incorporates a user interface that establishes and maintains connections between the extracted data and the original literature sources, enabling users to iteratively validate, correct, and refine the data. Additionally, SciDaSynth offers multi-faceted visual summaries of data dimensions and subsets, highlighting variations and inconsistencies across both qualitative and quantitative data. The system also supports flexible data grouping based on semantics and quantitative values, enabling users to standardize data by manipulating these groups and performing data coding or editing at the group level. In addition, follow-up query instructions can be applied to specific data groups for further refinement. We conduct a within-subject study with researchers from nutrition and natural language processing (NLP) domains to evaluate the efficiency and accuracy of SciDaSynth for data extraction from research literature. The quantitative analyses show that using SciDaSynth, participants could produce high-quality data in a much shorter time than the baseline methods. Moreover, we discuss user-perceived benefits and limitations.
In summary, our major contributions are:
SciDaSynth,1 an interactive system that integrates LLMs to assist researchers in extracting and structuring multimodal scientific data from the extensive literature. The system combines flexible data queries, multi-faceted visual summaries, and semantic grouping in a cohesive workflow, enabling efficient cross-document data validation, inconsistency resolution, and refinement.
The quantitative and qualitative results of our user study reveal the effectiveness and usability of SciDaSynth for data extraction from the scientific literature.
Implications for future system designs of human–AI interaction for data extraction and structuring.
Related Work
Structured Information Extraction From Literature
The exponential growth of scientific papers has generated large-scale data resources for building LLMs and applications for information extraction tasks, such as named entity recognition and relation extraction in scientific domains. These models fall into two broad categories: encoder-only (non-generative) LLMs and generative models (autoregressive LLMs). Encoder-only models (often called “autoencoding” models) are trained to produce compact vector representations of input text. These vectors are useful for downstream classification tasks (Lewis, Ott, et al. 2020). For example, SciBERT (Beltagy et al. 2019) uses the BERT architecture by pre-training on millions of scientific abstracts and full-text papers; it excels at classification, entity recognition, and retrieval tasks but is not good at generating new text. Variants like OpticalBERT and OpticalTable-SQA (Zhao, Huang, et al. 2023) fine-tune the models on domain-specific corpora (e.g., biomedical or materials science) to boost performance on specialized extraction tasks. Generative, or autoregressive, LLMs predict the next word in a sequence, enabling them to create fluent text and even structured outputs directly from user prompts. Models such as GPT-4 (OpenAI 2024) (and earlier InstructGPT) have hundreds of billions of parameters and have been further refined with techniques like instruction tuning and reinforcement learning from human feedback. This training paradigm allows zero-shot or few-shot prompting: users can describe an extraction task in natural language and receive structured results—JSON, CSV—without any additional fine-tuning. In materials science, Dagdelen et al. (Dagdelen et al. 2024) demonstrated how GPT-4 could extract entities and relations and output them as JSON records with high fidelity. In this paper, we chose GPT-4 series generative models over open-weight alternatives (e.g., Llama 3/4) for three key reasons. They are more reliable and accurate in following users' instructions (e.g., producing structured output) and handling complex, domain-specific queries. Moreover, unlike open-weight models that usually need data-intensive fine-tuning for domain adaptation, commercial LLMs like GPT-4 work well for low-resource settings across diverse domains out-of-the-box with great flexibility.
Besides, data in scientific literature is another particular focus for extraction. The data is usually stored in tables and figures in PDFs of research papers, and many toolkits are available to parse PDF documents, such as PaperMage (Lo et al. 2023), GROBID (GROBID 2008), Adobe Extract API (Adobe Inc. n.d.), GERMINE (Tkaczyk et al. 2015), GeoDeepShovel (Zhang et al. 2023), PDFFigures 2.0 (Clark and Divvala 2016). Here, we leverage the off-the-shelf tool to parse PDF text, tables, and figures. Besides the tools in the research community, Elicit (Elicit 2023) is a commercial software that facilitates systematic review. It enables users to describe what data to be extracted and create a data column to organize the results. However, it does not provide an overview of the extracted knowledge to help users handle variation and inconsistencies across different research literature. Here, we also formulate the knowledge as structured data tables. Moreover, we provide multi-faceted visual and text summaries of the data tables to help users understand the research landscape, inspect nuances between different papers, and verify and refine the data tables interactively.
Tools for Literature Reading and Comprehension
Research literature reading and comprehension is cognitively demanding, and many systems have been developed to facilitate this process (Head et al. 2021; August et al. 2023; Lee et al. 2016; Fok et al. 2023, 2024; Kang et al. 2022; Kang, Wu, et al. 2023; Kim et al. 2018; Chen et al. 2023; Peng et al. 2022; Jardim et al. 2022). One line of research studies aims to improve the comprehension and readability of individual research papers. To reduce barriers to domain knowledge, ScholarPhi (Head et al. 2021) provided in-situ support for definitions of technical terms and symbols within scientific papers. PaperPlain (August et al. 2023) helped healthcare consumers to understand medical research papers by AI-generated questions and answers and in-situ text summaries of every section. EvidenceMap (Kang, Sun, et al. 2023) leverages three-level abstractions to support medical evidence comprehension. Some work (Ponsard et al. 2016; Chau et al. 2011) designed interactive visualizations to summarize and group different papers and guide the exploration. Some systems support fast skimming of paper content. For example, Spotlight (Lee et al. 2016) extracted visual salient objects in a paper and overlayed it on top of the viewer when scrolling. Scim (Fok et al. 2023) enabled faceted highlighting of salient paper content. To support scholarly synthesis, Threddy (Kang et al. 2022) and Synergi (Kang, Wu, et al. 2023) facilitated a personalized organization of research papers in threads. Synergi further synthesized research threads with hierarchical LLM-generated summaries to support sensemaking. To address personalized information needs for a paper, Qlarify (Fok et al. 2024) provided paper summaries by recursively expanding the abstract. Some studies (Jardim et al. 2022; Marshall et al. 2016) developed and evaluated machine learning methods for risk of bias assessment in clinical trial reports. Although these systems help users digest research papers and distill knowledge with guidance, we take a step further by converting unstructured knowledge and research findings scattered within research papers into a structured data table with a standardized format.
Document QA Systems for Information Seeking
People often express their information needs and interests in the documents using natural language questions (ter Hoeve et al. 2020). Many researchers have been working on building question-answering models and benchmarks (Dasigi et al. 2021; Krithara et al. 2023; Jin et al. 2019; Vilares and Gómez-Rodríguez 2019; Ruggeri et al. 2023) for scientific documents. With recent breakthroughs in LLMs, some LLM-fused chatbots, such as ChatDoc (ChatDoc n.d.), ChatPDF (ChatPDF n.d.), ChatGPT (Open AI 2024), Claude (Anthropic 2024), are becoming increasingly popular for people to turn to when they have analytic needs for very long documents. However, LLMs can produce unreliable answers, resulting in hallucinations (Ji et al. 2023; Khullar et al. 2024). It is important to attribute the generated results to the source (or context) of the knowledge (Wang et al. 2024). Then, automated algorithms or human raters can examine whether the reference source really supports the generated answers using different criteria (Gao et al. 2023; Yue et al. 2023; Rashkin et al. 2023; Bohnet et al. 2023; Menick et al. 2022). In our work, we utilize RAG techniques (Lewis, Perez, et al. 2020) to improve the reliability of LLM output by grounding it on the relevant supporting evidence in the source documents. Then, we use quantitative metrics, such as context relevance, to evaluate the answer quality and prioritize users' attention on checking and fixing low-quality answers.
Formative Study
We aim to develop an interactive system that helps researchers distill, synthesize, and organize structured data from scientific literature in a systematic, efficient, and scalable way.2 To better understand the current practice and challenges they face during the process, we conducted a formative interview study.
Participants and Procedures
Participants
12 researchers (P1–P12, five females, seven males, ages: three from 18 to 24, nine from 25 to 34) were recruited from different disciplines, including medical and health sciences, computer science, social science, natural sciences, and mathematics. Nine obtained PhD degrees, and three were PhD researchers. All of them had extracted data (e.g., interventions and outcomes) from literature, 10 of which had further statistically analyzed data or narratively synthesized data. Seven rated themselves as very experienced, where they had led or been involved with the extraction and synthesis of both quantitative and qualitative data across multiple types of reviews. Five had expert levels of understanding and usage of computer technology for research purposes, and seven rated themselves at moderate levels.
Procedures
Before the interviews, we asked the participants to finish a pre-task survey, where we collected their demographics, experience with literature data extraction and synthesis, and understanding and usage of computer technology. Then, we conducted 50-min interviews with individuals over Zoom. During the interviews, we inquired about the participants' (1) their general workflow for data extraction from literature, desired organization format of data; (2) what tools were used for data extraction and synthesis, and what their limitations are; (3) expectations and concerns about computer and AI support.
Findings and Discussions
Workflow and Tools
After getting the final pool of included papers, participants first created a data extraction form (e.g., fields) to capture relevant information related to their research questions, such as data, methods, interventions, and outcomes. Then, they went through individual papers, starting with a high-level review of the title and abstract. Afterward, participants manually distilled and synthesized the relevant information required on the form. The data synthesis process often involved iterative refinement, where participants might go back and forth between different papers to update the extraction form or refine previous extraction results.
Common tools used by participants included Excel (9/12) and Covidence or Revman (4/12) for organizing forms and results of data extraction. Some participants also used additional tools like Typora, Notion, Python, or MATLAB for more specialized tasks or to enhance data organization. The final output of this process was structured data tables in CSV and XLSX format that provided a comprehensive representation of the knowledge extracted from the literature.
Challenges
Time-consuming to manually retrieve and summarize relevant data within the literature. Participants found it time-consuming to extract different types of data, including both qualitative and quantitative data, located in different parts of the papers, such as text snippets, figures, and tables. P1 commented, “Sometimes, numbers and their units are separated out at different places.” The time cost further increases when facing “many papers” (7/12) to be viewed, “long papers” (5/12), or papers targeting very specialized domains they are not so familiar with (5/12). P3 added, “When information is not explicit, such as limitations, I need to do reasoning myself.” P5 said, “It takes much time for me to understand, summarize, and categorize qualitative results and findings.”
Tedious and repetitive manual data entry from literature to data tables. After locating the facts and relevant information, participants need to manually input them into the data tables, which is quite low-efficiency and tedious. P3 pointed out, “…the data is in a table (of a paper), I need to memorize the numbers, then switch to Excel and manually log it, which is not efficient and can cause errors.” P4 echoed, “Switching between literature and tools to log data is tedious, especially when dealing with a large number of papers, which is exhausting.”
Significant workload to resolve data inconsistencies and variations across the literature. Almost all participants mentioned the great challenges of handling inconsistencies and variations in data, such as terminologies, abbreviations, measurement units, and experiment conditions, across multiple papers. It was hard for them to standardize the language expressions and quantitative measurements. P7 stated, “Papers may not use the same terms, but they essentially describe the same things. And it takes me lots of time to figure out the groupings of papers.” P9 said, “I always struggle with choosing what words to categorize papers or how to consolidate the extracted information.”
Inconvenient to maintain connections between extracted data and the origins in literature. The process of data extraction and synthesis often requires iterative review and refinement, such as resolving uncertainties and addressing missing information by revisiting original sources. However, when dealing with numerous papers and various types of information, the links between the data and their sources can easily be lost. Participants commonly relied on memory to navigate specific parts of papers containing the data, which is inefficient, unscalable, and error-prone. P8 admitted, “I can easily forget where I extract the data from. Then, I need to do all over again.”
Expectations and Concerns About AI Support
Participants anticipated that AI systems could automatically extract relevant data from literature based on their requests (7/12) and organize it into tables (9/12). They desired quick data summaries and standardization to (6/12) facilitate synthesis. Additionally, they wanted support for the categorization of papers based on user-defined criteria (4/12) and enabling efficient review and editing in batches (4/12). Besides, participants expected that the computer support should be easy to learn and flexibly adapt to their data needs. Many participants stated that the existing tools, like Covidence and Revman, were somewhat complex, especially for new users who may struggle to understand their functionalities and interface interactions.
Due to the intricate nature of scientific research studies, participants shared concerns about the accuracy and reliability of AI-generated results. They worried that AI lacks sufficient domain knowledge and may generate results based on the wrong tables/text/figures. P12 demanded that AI systems should highlight uncertain and missing information. Many participants requested validation of AI results.
Design Goals
Based on the current practice and challenges identified in our formative study and the specific needs of researchers engaged in data extraction, we distilled the following design goals:
DG1. Support flexible and comprehensive data extraction and structuring. The system should enable users to customize data extraction queries for diverse data dimensions and measures. To reduce manual effort, it should automate the extraction of both qualitative and quantitative data from various modalities such as text, tables, and figures. The extracted data should be organized into structured tables, providing a solid foundation for further refinement and analysis.
DG2. Enable multi-faceted data summarization and standardization. To address inconsistencies and variations across the literature, the system should provide an overview of key patterns and discrepancies in the extracted data regarding different dimensions and measures. It should also assist in standardizing data derived across multiple documents, such as terminologies, measurements, and categorizations.
DG3. Support efficient data validation and refinement. The system should address the need to ensure the accuracy and reliability of extracted data:
- 3.1.
Provide a preliminary evaluation of automated extracted data. This helps users identify data errors and prioritize effort in validation.
- 3.2.
Facilitate easy comparison of extracted data against original sources. This enables users to trace data origins to verify data accuracy.
- 3.3.
Enable efficient batch editing and refinement of data. It should support data entry for data subsets (e.g., with the same dimension values).
- 3.1.
System
Here, we introduce the design and implementation of SciDaSynth. First, we provide an overview of the system workflow (Figure 1). Then, we describe the technical pipeline of data extraction and structuring. Finally, we elaborate on the user interface designs and interactions.
[IMAGE OMITTED. SEE PDF]
System Workflow
After uploading the PDF files of research literature, users can interact with SciDaSynth with natural language questions (e.g., “What are the task and accuracy of different LMs?”) or customized data extraction forms in the chat interface (Figure 1). The system then processes this question and presents the user with a text summary and a structured data table (DG2). This interaction directly addresses the data needs without requiring tedious interface drag and drop (DG1).
The data table provided by SciDaSynth includes specific dimensions related to the user's question, such as “Model,” “Task,” and “Accuracy,” along with corresponding values extracted from the literature. To guide users' attention to areas needing validation, the system highlights missing values (“Empty” cells) and records with low relevance scores (DGs 3.1, 3.2). To validate and refine data records, users can view the relevant contexts used by the LLM, with important text spans highlighted (Figure 2 – middle). They can also access the original PDF.
[IMAGE OMITTED. SEE PDF]
To handle data inconsistencies across papers, the system provides a multi-level and multi-faceted data standardization interface (DG2). First, users can gain an overview of data attributes and their consistency information (Figure 2 – right). Upon selecting specific attributes, the system performs semantic grouping of attribute values to help users identify contextual patterns and distributions of potential inconsistencies (e.g., full form vs. abbreviation). Next, based on the grouped attribute values and their visual summary, users can create, modify, rename, or merge the groups, effectively categorizing the data. Once satisfied with their groupings, users can apply standardization results to instantly update the main data table (DG3.3). Throughout this process, the system provides real-time updates to charts and statistics to show the impact of standardization efforts.
Once satisfied with the data quality, users can add the table to the database, where it is automatically merged with existing data. This process can be repeated with new queries to incrementally build a comprehensive database (Figure 2 – middle). Finally, users can export the entire database in CSV format for further analysis or reporting.
Data Extraction and Structuring
We leverage LLMs3 to extract and structure data from scientific literature based on user questions (DG1). To mitigate hallucination issues and facilitate user validation of LLM-generated answers, we adopt the RAG framework by grounding LLMs in relevant information from the papers (as shown in Figure 1).
The process begins with parsing the paper PDF collection into tables, text snippets (e.g., segmented by length and sections), and figures using a state-of-the-art toolkit for processing scientific papers (Lo et al. 2023). To enhance the process, we generate data insights embedded in the figures using large vision-language models (i.e., GPT-4o in this paper). Due to the complexity of modern table structures, we also integrate LLMs to infer, parse out, and standardize the table structure (to CSV format) from table strings in PDF texts via few-shot prompting (details in Supporting Information: S1A).
Vector database construction. We construct a vector database for efficient retrieval and question-answering using the processed text, figures, and tables. For each component, we generate a concise text summary using LLMs to index the original, verbose content. This approach reduces noise and improves RAG quality. The text summary is then transformed into embeddings.
Question-based retrieval. When a user poses a question, we encode it as a vector and perform a similarity-based search to identify relevant original content by comparing it with the summary vectors of figures, tables, and text snippets in the database. The retrieved original content is later used for LLM's generation.
Data table structure inference. Based on the user's question, we prompt LLMs to infer and design the structure of the data tables, including column names and value descriptions. This guides the LLM in formulating scoped, consistent, and standardized responses across different papers.
Data extraction and structuring. The user question, retrieved document snippets, and the inferred data structure are fed into LLMs to produce the final data table and an associated summary. We instruct the LLMs to answer questions solely based on the provided contexts and to output “Empty” for values that cannot be determined from the given information. The organization of structured data tables facilitates convenient human validation and refinement (DG3).
Data quality evaluation. To assess the quality of the RAG process, we implement several unsupervised metrics computed by LLMs to judge answer quality regarding retrieved contexts and original questions (Es et al. 2024; Yu et al. 2024): answer relevancy measures how well the answer addresses the user's question; context relevancy evaluates the pertinence of the retrieved context to the question; and faithfulness assesses the degree to which the answer can be justified by the retrieved context. Additionally, we compute data missingness by tracking the proportion of “Empty” values in the generated table. This metric alerts users to insufficient or missing information in the original source (DG3.1).
| Data Table Structure Inference Example |
| Example: |
| Q: What are the tasks and accuracy of different LMs? |
| Inferred structure: |
| {“language_model_name”: “String: Name of the language model (e.g., GPT-3, BERT)”, |
| “tasks_supported”: “String: List of tasks the language model can perform (e.g., text generation, summarization, translation)”, |
| “accuracy_metric”: “String: Description of the accuracy metric used (e.g., F1, BLEU)”, |
| “accuracy_value”: “Float: Numerical value for the accuracy (0-100)”, |
| “accuracy_source”: “String: Source of the accuracy data (e.g., research paper, benchmark test)”} |
User Interface
Building upon the RAG-based technical framework, the user interface streamlines the data extraction, structuring, and refinement from scientific documents. Based on the LLM-generated data tables, users can perform iterative data validation and refinement by pinpointing and correcting error-prone data records and by resolving data inconsistencies via flexible data grouping regarding specific attributes. Finally, users can add the quality-ready data tables into the database. Here, we will introduce the system designs and interactions following the user workflow.
Flexibly Specify Information Needs
Upon uploading the scientific documents into SciDaSynth, users can formulate their questions in the Query tab by typing the natural language questions in the chat input (DG1, Figure 2). Alternatively, users can select and add specific data attributes and their detailed explanations and requirements in a form box to query some complex terminologies. This form includes suggested starting queries such as study summary, results, and limitations. Then, the system will respond to users' questions with a text summary and present a structured data table in the “Current Result” tab (DG2).
Guided Data Validation and Refinement
To make users aware of and validate the potentially problematic results (DG3.1), SciDaSynth highlights error-prone data records and values. Specifically, the system highlights “Empty” cells and the error-prone table records ( with a counter at the top) using unsupervised metrics (Section 4.2). Then, users can right-click the specific table records or cells to access a context menu, which provides the option to view the relevant contexts used by the LLMs for data generation. Within these contexts, important text spans that exactly match the generated data are highlighted for quick reference. If users find the contexts irrelevant or suspect LLM hallucination, they can easily access the original PDF content or parsed figures and tables in the right panel (Figure 2) for verification (DG3.2). After identifying the errors, users can double-click cells to edit the value and clear alerts by clicking in the rows.
Multi-Level and Multi-Faceted Data Summarization and Standardization
To resolve data inconsistencies across different literature, the system first present an overview of data attributes, data types, and inconsistency4 information (Figure 2 – right).
Dimension-guided data exploration. (DG2) After selecting data attributes, the system performs visual semantic data grouping based on attribute values. Specifically, each record (row) of the selected attributes in the table is transformed into a text description (“column_name: value”), encoded as a vector, and projected on the 2D plane as a dot5 where size correlates with frequency. Users can hover individual dots to see the column values and their group labels. In addition, they can select a group of dots to examine full data records in the “Current Result” table. The variations and similarities of dimension values for different rows are reflected in the distribution of clusters of dots using KMeans clustering. To concretize the data variations, each cluster is associated with a text label generated by LLMs' summarization. For example, the scatter plot groups “crops” values into colored clusters, such as sweet potatoes and maize. Besides, users can select multiple attributes at once, such as “nutrient value” and “measurement units,” to gain contextual insights into discrepancies in measurement units.
Group-based standardization. (DG2) After developing some high-level understanding of data variations, users can proceed to standardization of the selected data attributes (Figure 3). Users can start with an overview of total and unique values for each major group within the attribute, visualized through bar charts. Below the charts, the system displays individual group cards, each representing a cluster of similar values. These cards are color-coded based on the frequency of occurrences (high, medium, low), allowing users to quickly identify prevalent and rare entries. Within each card, the system lists all unique value variations, along with their frequency counts. This granular view enables users to easily spot inconsistencies, misspellings, or variations in terminology.
[IMAGE OMITTED. SEE PDF]
The interface supports the following interactive standardization:
Users can create new groups or rename existing ones to better categorize the data.
Users drag-and-drop individual value entries between groups, facilitating the consolidation of similar terms.
Inline editing tools enable users to modify group names or individual value entries directly.
For individual group card, users can apply the standardization to the data table by clicking . The results can be tracked and viewed by clicking .
Iterative Table Editing and Database Construction
When satisfied with the quality of the current table, users can add it to the database, where the system automatically merges it with existing data using outer joins on document names. This process can be repeated with new queries, allowing for the incremental construction of a comprehensive database. Once the data extraction is complete, users can download the entire database in CSV format for further analysis.
Evaluation Design
We employed a within-subjects design where participants used both SciDaSynth and a baseline system to extract data from two paper collections. The study aimed to answer the following research questions:
Effectiveness of data extraction:
- –
Data quality: How does SciDaSynth impact the quality of extracted data?
- –
Efficiency: How does SciDaSynth affect the speed of data extraction?
- –
User perceptions: What are the perceived benefits and limitations of system designs and workflows?
Experiment Settings
Participants
We recruited a diverse group of 24 researchers:
Group A: Nutrition Science (12 participants; P1–P12) This group (8 females, 4 males, aged 18–44) specialized in nutritional sciences, including food science and technology, human nutrition, medical and health sciences, and life sciences. There were five postdoctoral fellows and seven PhD students, all actively engaged in research. Their technical expertise varied: five were expert users who regularly coded and programmed, while seven were intermediate users who coded as needed.
Group B: NLP/ML Researchers (8 participants; P13–P20) This group (5 males, 3 females, aged 18–33) included researchers in NLP, machine learning, and artificial intelligence. There were two postdocs, five PhD students, and one MPhil student. They were expert in computer programming.
All participants were familiar with the target dataset dimensions through research or literature dimensions. They had experience in data extraction and synthesis for research studies.
Datasets
We collected and processed two domain datasets based on the recent survey papers. The surveys cover diverse paper types and formats, such as randomized trials, peer-reviewed articles, meeting abstracts and posters, and doctoral theses:
Dataset I (Nutrition Science) is based on a systematic review in Nature Food (Huey et al. 2023), focusing on micronutrient retention in biofortified crops through various processing methods.
Dataset II (Large Language Models) is derived from a recent LLM survey (Zhao, Zhou, et al. 2023), covering various aspects of LLM development and applications.
Table 1 Statistics of papers in Datasets I and II.
| Page # | Character # | Figure # | Table # | |
| Dataset I | 9.40 (2.97) | 34,842 (14569) | 3.6 (1.95) | 5.2 (3.35) |
| Dataset II | 24.10 (7.50) | 74,882 (22077) | 7.00 (3.97) | 13.60 (4.30) |
Baseline Implementation
Baseline A (Human). A simplified version of SciDaSynth without automated data extraction or structuring, designed to replicate current manual practices. It includes (1) A PDF viewer with annotation, highlighting, and searching capabilities; (2) Automatic parsing of paper metadata, tables, and figures; (3) A data entry interface for manual table creation; (4) Side-by-side views of PDFs and data tables; This baseline allows us to assess the impact of SciDaSynth's automated features and standardization support on efficiency and data quality.
Baseline B (Auomated GPT). We developed a fully automated system based on GPT-3.5/4 to generate data tables according to specified data dimensions. This baseline was intended to evaluate the accuracy of our technical framework for automatic data table generation. The implementation followed the data extraction and structuring approach of SciDaSynth (described in Section 4.2). To produce the data tables for comparison, we input the data attributes and their descriptions in JSON format as queries into the system and generate two data tables for the two splits (i.e., four data points) of each dataset.
Tasks
We designed tasks to simulate real-world data extraction scenarios while allowing for controlled evaluation across two distinct domains. Each participant was assigned to work with one complete dataset (either Dataset I or Dataset II), consisting of 20 papers in total.
Nutrition science researchers (P1–P12) worked with Dataset I, extracting “crops (types),” “micronutrients (being retained),” “absolute nutrient raw value,” and “raw value measurement units.”
NLP/ML researchers (P13–P20) worked with Dataset II, extracting “model name,” “model size,” “pretrained data scale,” “hardware specifications (GPU/TPU).”
These dimensions covered both qualitative and quantitative measurements, requiring engagement with various parts of the papers.
To ensure a robust comparison between the two interactive systems (SciDaSynth and Baseline A (manual system)), we split each dataset into two subsets of 10 papers each, ensuring balanced distribution of paper characteristics such as paper length, number of figures, and tables. Each participant extracted all four data dimensions for every paper in their assigned dataset (I or II) using both systems. This means that each participant used both systems for both subsets, producing two data tables in total. Baseline B (automated baseline) was not operated by participants; it was separately run to compare automated performance against human-in-the-loop systems. To mitigate the ordering effects (e.g., learning effect), we counterbalanced the order of system usage (SciDaSynth first or Baseline A first) and dataset splits (Split 1 first or Split 2 first), resulting in a 2 (system order) × 2 (data split order) within-subject design. Participants organized the extracted data into tables and downloaded them from the systems. The scenario was framed as “working with colleagues to conduct a systematic review.”
Following the structured tasks, participants engaged in an open-ended exploration of the whole dataset with SciDaSynth, allowing for insights into the system's full capacity in research use.
Procedure
We conducted the experiment remotely via Zoom, with both Baseline A and SciDaSynth deployed on a cloud server. The study followed a structured procedure: pre-study setup and briefing (10 min), system tutorials (10 min each), main tasks with both systems and datasets (counterbalanced ordered), post-task surveys, free exploration of SciDaSynth (15 min), and a concluding interview.
Participants first provided consent and background information. They then received tutorials on each system before performing data extraction tasks. After completing structured tasks with both systems, participants freely explored SciDaSynth while thinking aloud. The study concluded with a semi-structured interview gathering feedback on system designs, workflow, and potential use cases.
Each session lasted approximately 2.5 h, with participants compensated $30 USD. This procedure enabled collection of comprehensive quantitative and qualitative data on SciDaSynth's performance and user experience across different research domains.
Measurements
We evaluated SciDaSynth using quantitative and qualitative measures focusing on effectiveness, efficiency, and user perceptions, with separate analyses for Datasets I and II.
Effectiveness of data extraction was assessed by evaluating the data quality and task completion time. For data quality, we compared the data tables generated by participants using SciDaSynth, Baseline A, and the automated GPT baseline (Baseline B) against the original data tables from the systematic review. For each dataset, two expert raters who were blind to the system conditions independently scored the 3-point scale: 0 (Not Correct), 1 (Partially Correct),6 and 2 (Correct), based on accuracy and completeness. The inter-rater agreement on individual dimensions is measured by Cohen's . Generally, two raters had good agreement () (as shown in Table S4). Disagreements were resolved through discussion to reach consensus scores. For SciDaSynth and Baseline A, we calculated participants' scores for the corresponding dataset (one per participant, each ranging from 0 to 20). Then, the paired Student's t-test was performed to compare the average scores of SciDaSynth and Baseline A. Baseline B yielded two scores per dataset split, compared using Mann–Whitney U-tests (Kang, Wu, et al. 2023).
For task efficiency, we measured task completion time from the moment the PDFs were uploaded to the system to the moment the final data table was downloaded. The task completion times for SciDaSynth and Baseline A were compared using paired Student's t-tests.
User perceptions were evaluated through post-task questionnaires and interviews. We used the NASA Task Load Index (6 items) to assess perceived workload, and an adapted Technology Acceptance Model (5 items) to measure system compatibility and adaptability, both using 7-point scales (Kang, Wu, et al. 2023; Wu and Wang 2005). Custom items gauged perceived utility in areas such as paper overview, workflow simplification, data handling, and confidence. Questionnaire data were analyzed using Wilcoxon signed-rank tests, with separate analyses for each dataset. Qualitative feedback on system designs, workflows, and potential use cases was collected through post-study interviews and summarized to provide context and depth to our quantitative findings.
Results and Analyses
Effectiveness of Data Extraction
Data Quality
As illustrated in Figure 4, SciDaSynth consistently achieved the highest data quality scores across both datasets. For Dataset I (Nutrition Science), SciDaSynth (M = 17.58, SD = 1.44) significantly outperformed Baseline A (M = 16.08, SD = 2.02; t = 2.34, ) and Baseline B (M = 13.25, SD = 0.50; U = 48.0, ). Moreover, the manual extraction method (Baseline A) also significantly outperformed the automated GPT method (Baseline B) (U = 44.0, ), highlighting the challenges faced by fully automated systems in this domain. In Dataset II (Large Language Models), SciDaSynth (M = 17.79, SD = 1.05) also significantly outperformed Baseline A (M = 16.56, SD = 1.31; t = 2.73, ) and Baseline B (M = 14.50, SD = 0.76; U = 48.0, ). Similar to Dataset I, the manual extraction method (Baseline A) significantly outperformed the automated GPT method (Baseline B) (U = 43.0, ). In addition, SciDaSynth outperformed both baselines in all data dimensions (shown in Tables S2 and S3), while all three systems struggled with some error-prone fields such as raw nutrient values and units. These results demonstrate SciDaSynth's effectiveness in producing high-quality data extractions across diverse scientific domains.
[IMAGE OMITTED. SEE PDF]
Efficiency
Figure 5 illustrates the task completion time for SciDaSynth compared to Baseline A across both datasets. For Dataset I, participants using SciDaSynth completed the task in significantly less time (M = 44.74, SD = 9.37 min) compared to Baseline A (M = 73.22, SD = 14.17 min; t = 5.19, ). This represents a substantial 38.9% reduction in task completion time. The efficiency gain was even more pronounced for Dataset II, with SciDaSynth (M = 29.72, SD = 9.37 min) significantly outperforming Baseline A (M = 61.59, SD = 9.43 min; t = 7.38, ), amounting to a 51.7% reduction in task completion time.
[IMAGE OMITTED. SEE PDF]
These findings, combining superior data quality and significantly faster task completion times, demonstrate SciDaSynth's capability to facilitate more efficient and accurate data extraction across various scientific domains.
Case Analyses of Automated LLM Baseline
The automated Baseline B achieved an overall accuracy of 67.5% (13.50/20) for Dataset I (Nutrition Science) and 73.8% (14.75/20) for Dataset II (LLM Survey). We investigated the failure cases across both datasets and identified three major reasons for these shortcomings:
First, an incomprehensive understanding of the query in the specific paper context. This issue was the most prevalent and more severe in Dataset I (Nutrition Science) due to its domain-specific terminology. When asking about raw nutrient values in crops, Baseline B failed to contextualize the meaning of “raw” in individual paper contexts. For example, some papers might use words like “unprocessed” and “unwashed” or imply it in the tables with the processing start time equal to zero, which the system failed to recognize. Also, there were cases where one paper could have multiple crop types, but Baseline B extracted only one. In Dataset II, “pretrained data scale” was sometimes misinterpreted as fine-tuning dataset sizes by LLMs.
Second, insufficient and incorrect table and figure information extraction. Both datasets presented challenges in this area. In Dataset I, many failure cases stemmed from the retrieved tables and figures. Some tables, which had very complex designs and structures (e.g., hierarchical data dimensions), were parsed incorrectly. And some information in the figures was overlooked. The quality of the retrieved information impacted the LLMs' reasoning, resulting in outputting “empty” cell values for specific dimensions.
Third, missing associations between different parts of papers. In some instances, data in tables were incomplete and required interpretation with information from other sections. For example, when asking for what crops are in a paper, the system retrieved and reported all crop variety numbers from one table instead of crop names. However, the corresponding crop names were recorded in method sections, demonstrating the mappings between crop names and their variety numbers. Similarly, in Dataset II, when extracting “model name” and “model size,” LLMs sometimes reported only one model type (e.g., BERT) and their sizes without identifying and differentiating the model variants (e.g., T5-base and T5-large) mentioned in different pages or sections.
User Perceptions Towards SciDaSynth
Streamline the Data Extraction Workflow
Participants consistently reported that SciDaSynth significantly simplified and accelerated their data extraction process, addressing the key challenge of efficiently processing multimodal information within scientific literature. The system's ability to automatically generate structured data tables from diverse sources within papers was particularly praised for its time-saving benefits. P12 highlighted the efficiency gain, stating, “I was impressed by how the system could extract and combine information from method descriptions, results tables, and even figures into a coherent data table. This usually takes me hours to do manually.” This sentiment was echoed by P8, “The query helps me to find all of the key information, and I only need to verify them. That improves my efficiency a lot.” Quantitative results supported these observations, with participants rating the “effectiveness of simplifying data extraction workflow” significantly higher for SciDaSynth (M = 5.83, SD = 0.58) compared to Baseline A (M = 4.33, SD = 1.50, p = 0.012).
The system's ability to understand and accurately respond to user queries was highly rated (M = 6.00, SD = 0.74), as was the quality of the generated data tables (M = 5.50, SD = 0.80). This interaction was deemed as “natural” (P9), “user-friendly” (P4), and “comfortable” (P12). P7 provided a concrete example: “When I asked about ‘nutrient retention in crops after processing’, the system not only extracted the relevant data but also correctly identified and populated columns for crop types, nutrient names, retention percentages, and processing methods. This would have taken me significantly longer to compile manually.”
Multi-Level, Multi-Faceted Summary and Standardization
Participants reported that SciDaSynth significantly enhanced their ability to standardize data across the paper collection compared to Baseline A (M = 5.67, SD = 0.49 vs. M = 3.75, SD = 1.06, p = 0.005). The multi-level interactive visualizations were particularly effective in helping users identify and resolve data inconsistencies. P1 recalled, “By selecting ‘crops’, ‘nutrients’ and ‘measurement units’ dimensions in the scatter plot, I immediately spotted that some papers were using ‘g/g’ while others used ‘mg/100 g’ for beta-carotene content in sweet potatoes. This prompted me to unify these units.”
Building upon the insights from the semantic grouping visualization, participants found the group-based standardization feature crucial for efficiently resolving identified inconsistencies. SciDaSynth significantly facilitated this process compared with Baseline A (M = 5.67, SD = 0.83 vs. M = 4.25, SD = 1.36, p = 0.019). P7 described, “After noticing a cluster in the scatter plot representing various terms for orange-fleshed sweet potatoes, I used the group-based standardization interface. Here, I could see all variations like ‘Orange Sweet Potato’, ‘Orange-fleshed Sweet Potato’, and ‘OFSP’ listed together. I created a standardized term ‘OFSP’ and dragged all variations into this group. The system then automatically updated all relevant entries in the data table.”
Overall, participants found that the integration of visual summarization and standardization tools significantly improved the efficiency and accuracy of the data standardization process.
Enhanced Data Validation and Refinement
Participants found SciDaSynth highly effective for locating (M = 5.50, SD = 0.67), organizing (M = 5.92, SD = 0.79), validating (M = 5.17, SD = 0.83), and editing data (M = 5.75, SD = 0.45) (all ). The system's ability to quickly navigate to relevant parts of papers was particularly appreciated. P4 added, “I could easily access pdf, which is great. The option to look at context is also helpful to verify the data. For example, I easily cross-checked the data by referring to context without skimming the whole paper.” The highlighting of important text spans in retrieved contexts proved crucial for efficient validation. P18 shared, “When checking model sizes for different language models, the highlighted text made it easy verify the data accuracy.”
Besides, many participants praised the batch editing feature. P22 mentioned, “I find several clusters pointing at the same models. … after locating them in the table, it was super convenient for me to edit multiple rows of data in the table at once.”
Cognitive Workload and User Experience
Participants reported that SciDaSynth significantly reduced their workload and integrated well with their existing data extraction processes. They showed a stronger interest in using SciDaSynth (M = 5.75, SD = 0.97) compared to Baseline A (M = 3.92, SD = 1.31) in their future work (p = 0.002). The system demonstrably lowered participants' mental workload (M = 3.17, SD = 1.03 vs. M = 4.75, SD = 0.97, p = 0.015) and physical workload (M = 2.92, SD = 1.00 vs. M = 4.25, SD = 1.22, p = 0.034) compared to Baseline A. Participants found SciDaSynth highly compatible with their existing workflows. There were significant differences between SciDaSynth and Baseline A in terms of compatibility (p = 0.027) and fit with expected ways of working (p = 0.005).
Despite the addition of new visualizations and interactions, participants generally found SciDaSynth easy to learn. P15 remarked, “The system is intuitive to use, from the query interface to the results presentation.” While SciDaSynth received a slightly higher score on the learning difficulty scale compared to Baseline A, this difference was not significant (p = 0.21).
Some participants noted a brief learning curve for certain advanced features. P10 mentioned, “Operations on cluster results, like enlarging or clearing filters, and group standardization took some getting used to.” P18 added, “Each component is easy to learn individually, though some parts like scatter plot didn't immediately align with my habits.”
Participants also provided some valuable suggestions for system improvement. P8 advised supporting different languages. P5 suggested, “I tend to make notes and comments throughout the extraction, and it may be helpful to have a field dedicated to it.” Other suggestions mainly involve enriching table operations (e.g., change column orders (P6), tracking data provenance and reversing back the changes (P1, P20).
Participants Remained Cautious of AI-Generated Results
Participants reported comparable levels of confidence in data tables built with SciDaSynth (M = 5.75, SD = 0.45) and Baseline A (i.e., manual extraction) (M = 5.08, SD = 0.90, p = 0.33). However, qualitative feedback revealed a more nuanced picture of user trust and usage patterns.
Researchers' trust in AI-generated results varied based on the nature of the extraction task. They expressed higher confidence in the system's ability to extract “straightforward” (P1, P3) and “exact” data (P8), particularly for qualitative analyses and standard metrics. P18 noted, “For standard hardware specifications, the system was consistently accurate. But for nuanced details about model sizes or training data scales, I felt compelled to verify manually.” Participants developed strategies to leverage the system's efficiency while ensuring accuracy. They appreciated data visualizations to get some sense of what data might look like and context verification feature. Some researchers, like P11, described an approach of gradual trust-building: “If I'm not familiar with the area, I will read by myself first. After I get familiar with the paper type and research paradigms, I will use this system more confidently.” Interestingly, participants' trust in the system evolved through use. P20 observed, “Initially, I was skeptical of every extracted data point. But after verifying the system's accuracy on papers I knew well, I became more confident in its capabilities for similar extraction tasks.” This evolution was also noted among nutrition researchers. P5 said, “As I used the system more, I learned which types of information it extracted reliably and which required more careful verification.”
Discussion
Design Implications
Structured Data Organization Beyond Table
In this study, we developed a technical framework that enables the generation of structured data tables from a candidate pool of scientific papers in response to users' data extraction queries. The structured data table helped externalize and standardize the large scale of unstructured knowledge embedded in the paper collections. According to the user study, the structured data table provided a good basis for a global understanding of paper collections, and interactive visualizations of data improved awareness of data variations in different dimensions. In the future, systems can consider other data representations beyond table format for structuring and presenting knowledge. For example, the mind map is a useful diagram that can visually summarize the hierarchy within data, showing relationships between pieces of the whole. It can help users build a conceptual framework and taxonomy for paper collections, identify future research directions, and present research findings by branching out to sub-findings, implications, and recommendations. In addition, knowledge graphs could be useful for presenting and explaining the integration of data from multiple sources. They can also enrich data with semantic information by linking entities to concepts in an ontology, adding layers of meaning and context, and revealing hidden connections between entities.
Reduce Context Switch by In-Situ Information Highlighting
To assist users in locating, validating, and refining data, SciDaSynth establishes, highlights, and maintains the connections between data and relevant information in the literature. In the user study, participants favored the keyword highlighting in the pop-ups of relevant data contexts for corresponding rows. And they could easily access the original source PDFs for each data record. Both of these designs helped them validate the data quality. However, some participants pointed out that they needed to switch between different tabs to validate data tables with the source PDF content. They also desired the text highlighting in the original paper PDFs. All of these benefits and challenges in data validation emphasize the importance of designs for reducing context switches and in situ highlighting of information in knowledge extraction tasks.
Provide Analytical Guidance During Information Extraction
During the system exploration in the user study, some participants mentioned that they were hesitant about what questions to ask and how they should be formatted when facing paper collections that they might not be very familiar with. The future system should provide adaptive support and guidance for users to navigate the complex information space by suggesting information questions or user interactions for initial start, follow-ups, and clarifications (August et al. 2023; Wang et al. 2022). Those user questions and interaction suggestions could also be learned from users' feedback and dynamic interactions as the question-answering process progresses.
Promote Collaborative Effort for Knowledge Extraction
In this study, we designed and built an interactive system, SciDaSynth, that facilitates users in extracting structured data from scientific literature based on LLM-generated results. The user study showed that SciDaSynth improved the efficiency of data extraction while presenting a comparable accuracy to the human baseline. However, the accuracy of both systems used by individual researchers was below 90% despite the simple questions. There was still significant room for improvement regarding the quality of the data extracted by individuals. This showed that data extraction from literature is a demanding and challenging task. The system designs and workflow can further consider how to promote collaborative effort among individuals to extract and synthesize higher-quality and more reliable data.
Limitations and Future Work
We discuss the limitations and future work based on our design and evaluation of SciDaSynth.
The technical limitations for future work include:
Improving domain context understanding. Currently, we use vanilla GPT3.5/4 to build a technical pipeline for data extraction from domain-specific literature. As reflected in the user study, the LLMs may still lack a deep understanding of the specialized domains and may impact users' usage and trust of the results. Therefore, future work can consider enhancing the domain knowledge and reasoning of LLMs via various approaches, such as model fine-tuning on domain-related articles and iterative human-in-the-loop feedback.
Incorporate more methods to measure the quality of auto-generated results. We only considered the data relevance and missingness metrics to guide users' attention for cross-checking potentially low-quality data. Moreover, we rely on exact string matches to highlight answer-context relationships for easier eyeballing. However, errors could occur that are not captured by our metrics and may negatively impact the final data quality. In the future, we can develop and integrate more quantitative metrics and fuzzy matching methods to provide users with a more comprehensive understanding of LLM performance.
The user study evaluation has pointed out future work directions:
Enhance evaluation with diverse and larger user groups. In this study, we evaluated our system with 24 researchers who came from nutritional science and NLP-related backgrounds. Inviting more researchers from different disciplines would further enhance the evaluation of SciDaSynth.
Conduct a longitudinal study in real research scenarios. The user study was conducted based on a set of predefined data extraction tasks and paper collections. However, in real research settings, participants may have interests in different data dimensions and paper topics. A longitudinal study of how researchers would use SciDaSynth can further help validate and comprehensively identify its benefits and limitations.
Evaluate different prompting strategies and openweight LLMs. In this paper, we utilize GPT-4 series models to process documents and user queries. We can further extend the system by integrating openweight models, such as Llama and QWen. Moreover, evaluating prompt variations and their impacts on system performance would help us develop and optimize prompting strategies for better LLM performance.
Conclusion
In this paper, we developed SciDaSynth, an interactive system to help researchers extract and structure data from scientific literature, previously screened and eligible for a given research question, in an efficient and systematic way. Particularly, we built an LLM-based RAG framework to automatically build structured data tables according to users' data questions. Then, the system provided a suite of visualizations and interactions to guide data validation and refinement, featuring multi-level and multi-faceted data summarization and standardization support. Through a within-subjects study with 24 researchers from nutrition and NLP domains, we demonstrated the effectiveness of data extraction via quantitative metrics and qualitative feedback. We further discussed design implications and limitations based on the system designs and evaluation.
Author Contributions
X.W. co-led ideation, technical implementation, and manuscript writing. He also co-designed the user study and contributed to data curation and results analysis. S.L.H. led data curation, co-designed the user study, and co-led ideation and manuscript writing. R.S. contributed to the technical implementation, user study, and results analysis. S.M. and F.W. co-led ideation, system designs, and the manuscript.
Acknowledgments
S.L.H. was supported by the NIH under award 5T32HD087137. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) or the National Institutes of Health.
Data Availability Statement
The authors have nothing to report.
Adobe Inc. n.d. Adobe PDF Services Api. Accessed March 2024. https://developer.adobe.com/document-services/apis/pdf-extract/.
Anthropic. 2024. Claude. Accessed March 2024. https://claude.ai/chats.
August, T., L. L. Wang, J. Bragg, M. A. Hearst, A. Head, and K. Lo. 2023. “Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers With Natural Language Processing.” ACM Transactions on Computer‐Human Interaction 30, no. 5: 74. https://doi.org/10.1145/3589955.
Beltagy, I., K. Lo, and A. Cohan. 2019. “SciBERT: A Pretrained Language Model for Scientific Text.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 3615–3620. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1371.
Bohnet, B., V. Q. Tran, P. Verga, et al. 2023. Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models.
ChatDoc. n.d. Chatdoc. Accessed March 2024. https://chatdoc.com/.
ChatPDF. n.d. Chatpdf. Accessed March 2024. https://www.chatpdf.com/.
Chau, D. H., A. Kittur, J. I. Hong, and C. Faloutsos. 2011. “Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning.” In Proceedings of the 2011 CHI Conference on Human Factors in Computing Systems, 167–176. ACM.
Chen, X., C.‐S. Wu, L. Murakhovs'ka, et al. 2023. “Marvista: Exploring the Design of a Human‐AI Collaborative News Reading Tool.” ACM Transactions on Computer‐Human Interaction 30, no. 6: 92. https://doi.org/10.1145/3609331.
Clark, C., and S. Divvala. 2016. “Pdffigures 2.0: Mining Figures From Research Papers.” In Proceedings of the 16th ACM/IEEE‐CS on Joint Conference on Digital Libraries, 143–152. ACM.
Dagdelen, J., A. Dunn, S. Lee, et al. 2024. “Structured Information Extraction From Scientific Text With Large Language Models.” Nature Communications 15, no. 1: 1418. https://doi.org/10.1038/s41467-024-45563-x.
Dasigi, P., K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner. 2021. “A Dataset of Information‐Seeking Questions and Answers Anchored in Research Papers.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4599–4610. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.365.
Elicit. 2023. Elicit: The AI Research Assistant.
Es, S., J. James, L. Espinosa Anke, and S. Schockaert. 2024. “RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, edited by N. Aletras and O. De Clercq, 150–158. Association for Computational Linguistics.
Fok, R., J. C. Chang, T. August, A. X. Zhang, and D. S. Weld. 2024. Qlarify: Bridging Scholarly Abstracts and Papers With Recursively Expandable Summaries.
Fok, R., H. Kambhamettu, L. Soldaini, et al. 2023. “Scim: Intelligent Skimming Support for Scientific Papers.” In Proceedings of the 28th International Conference on Intelligent User Interfaces, 476–490. ACM. https://doi.org/10.1145/3581641.3584034.
Gao, L., Z. Dai, P. Pasupat, et al. 2023. “RARR: Researching and Revising What Language Models Say, Using Language Models.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 16477–16508. ACL. https://doi.org/10.18653/v1/2023.acl-long.910.
GROBID. 2008–2024. Grobid. https://github.com/kermitt2/grobid.
Head, A., K. Lo, D. Kang, et al. 2021. “Augmenting Scientific Papers With Just‐in‐Time, Position‐Sensitive Definitions of Terms and Symbols.” In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 18. ACM. https://doi.org/10.1145/3411764.3445648.
Huey, S. L., E. M. Konieczynski, N. H. Mehta, et al. 2023. “A Systematic Review of the Impacts of Post‐Harvest Handling on Provitamin A, Iron and Zinc Retention in Seven Biofortified Crops.” Nature Food 4, no. 11: 978–985. https://doi.org/10.1038/s43016-023-00874-y.
Jardim, P. S. J., C. J. Rose, H. M. Ames, J. F. M. Echavez, S. Van de Velde, and A. E. Muller. 2022. “Automating Risk of Bias Assessment in Systematic Reviews: A Real‐Time Mixed Methods Comparison of Human Researchers to a Machine Learning System.” BMC Medical Research Methodology 22, no. 1: 167.
Ji, Z., N. Lee, R. Frieske, et al. 2023. “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys 55, no. 12: 1–38.
Jin, Q., B. Dhingra, Z. Liu, W. Cohen, and X. Lu. 2019. “PubMedQA: A Dataset for Biomedical Research Question Answering.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2567–2577. ACL. https://doi.org/10.18653/v1/D19-1259.
Kang, H., J. C. Chang, Y. Kim, and A. Kittur. 2022. “Threddy: An Interactive System for Personalized Thread‐Based Exploration and Organization of Scientific Literature.” In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, 1–15. ACM.
Kang, H. B., T. Wu, J. C. Chang, and A. Kittur. 2023. “Synergi: A Mixed‐Initiative System for Scholarly Synthesis and Sensemaking.” In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–19. ACM.
Kang, T., Y. Sun, J. H. Kim, et al. 2023. “EvidenceMap: A Three‐Level Knowledge Representation for Medical Evidence Computation and Comprehension.” Journal of the American Medical Informatics Association 30, no. 6: 1022–1031.
Khullar, D., X. Wang, and F. Wang. 2024. “Large Language Models in Health Care: Charting a Path Toward Accurate, Explainable, and Secure AI.” Journal of General Internal Medicine 39: 1239–1241. https://doi.org/10.1007/s11606-024-08657-2.
Kim, D. H., E. Hoque, J. Kim, and M. Agrawala. 2018. “Facilitating Document Reading by Linking Text and Tables.” In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, UIST '18, 423–434. ACM.
Krithara, A., A. Nentidis, K. Bougiatiotis, and G. Paliouras. 2023. “Bioasq‐qa: A Manually Curated Corpus for Biomedical Question Answering.” Scientific Data 10, no. 1: 170. https://doi.org/10.1038/s41597-023-02068-4.
Lee, B., O. Savisaari, and A. Oulasvirta. 2016. “Spotlights: Attention‐Optimized Highlights for Skim Reading.” In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5203–5214. ACM. https://doi.org/10.1145/2858036.2858299.
Lehman, E., J. DeYoung, R. Barzilay, and B. C. Wallace. 2019. “Inferring Which Medical Treatments Work From Reports of Clinical Trials.” arXiv preprint arXiv:1904.01606.
Lewis, P., M. Ott, J. Du, and V. Stoyanov. 2020. “Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State‐of‐the‐Art.” In Proceedings of the 3rd Clinical Natural Language Processing Workshop, 146–157. ACL. https://doi.org/10.18653/v1/2020.clinicalnlp-1.17.
Lewis, P., E. Perez, A. Piktus, et al. 2020. “Retrieval‐Augmented Generation for Knowledge‐Intensive Nlp Tasks.” Advances in Neural Information Processing Systems 33: 9459–9474.
Lo, K., Z. Shen, B. Newman, et al. 2023. “PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually‐Rich Scientific Documents.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 495–507. ACL. https://doi.org/10.18653/v1/2023.emnlp-demo.45.
Marshall, I. J., J. Kuiper, and B. C. Wallace. 2016. “Robotreviewer: Evaluation of a System for Automatically Assessing Bias in Clinical Trials.” Journal of the American Medical Informatics Association 23, no. 1: 193–201.
Menick, J., M. Trebacz, V. Mikulik, et al. 2022. Teaching Language Models to Support Answers With Verified Quotes.
Nye, B. E., A. Nenkova, I. J. Marshall, and B. C. Wallace. 2020. “Trialstreamer: Mapping and Browsing Medical Evidence in Real‐Time.” In Proceedings of the Conference. Association for Computational Linguistics. North American Chapter, Vol. 2020, 63.
Open AI. 2024. Chatgpt. Accessed March 2024. https://chat.openai.com/.
OpenAI. 2024. GPT‐4 Technical Report.
Peng, Z., Y. Liu, H. Zhou, Z. Xu, and X. Ma. 2022. “Crebot: Exploring Interactive Question Prompts for Critical Paper Reading.” International Journal of Human‐Computer Studies 167: 102898. https://doi.org/10.1016/j.ijhcs.2022.102898.
Ponsard, A., F. Escalona, and T. Munzner. 2016. “Paperquest: A Visualization Tool to Support Literature Review.” In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 2264–2271. ACM. https://doi.org/10.1145/2851581.2892334.
Rashkin, H., V. Nikolaev, M. Lamm, et al. 2023. “Measuring Attribution in Natural Language Generation Models.” Computational Linguistics 49, no. 4: 777–840. https://doi.org/10.1162/coli_a_00486.
Ruggeri, F., M. Mesgar, and I. Gurevych. 2023. “A Dataset of Argumentative Dialogues on Scientific Papers.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 7684–7699. ACL. https://doi.org/10.18653/v1/2023.acl-long.425.
Taylor, K. S., K. R. Mahtani, and J. K. Aronson. 2021. “Summarising Good Practice Guidelines for Data Extraction for Systematic Reviews and Meta‐Analysis.” BMJ Evidence‐Based Medicine 26, no. 3: 88–90. https://doi.org/10.1136/bmjebm-2020-111651.
ter Hoeve, M., R. Sim, E. Nouri, A. Fourney, M. de Rijke, and R. W. White. 2020. “Conversations With Documents: An Exploration of Document‐Centered Assistance.” In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, 43–52. ACM.
Tkaczyk, D., P. Szostek, M. Fedoryszak, P. J. Dendek, and Ł. Bolikowski. 2015. “Cermine: Automatic Extraction of Structured Metadata From Scientific Literature.” International Journal on Document Analysis and Recognition 18: 317–335.
Vilares, D., and C. Gómez‐Rodríguez. 2019. “HEAD‐QA: A Healthcare Dataset for Complex Reasoning.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 960–966. ACL. https://doi.org/10.18653/v1/P19-1092.
Wang, X., F. Cheng, Y. Wang, et al. 2022. Interactive Data Analysis With Next‐Step Natural Language Query Recommendation.
Wang, X., R. Huang, Z. Jin, T. Fang, and H. Qu. 2024. “Commonsensevis: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models.” IEEE Transactions on Visualization and Computer Graphics 30, no. 1: 273–283. https://doi.org/10.1109/TVCG.2023.3327153.
World Health Organization. 2014. WHO Handbook for Guideline Development. 2nd ed. World Health Organization.
World Health Organization. 2020. Breastfeeding and Covid‐19: Scientific Brief. Accessed May 30, 2025.
Wu, J.‐H., and S.‐C. Wang. 2005. “What Drives Mobile Commerce?: An Empirical Evaluation of the Revised Technology Acceptance Model.” Information & Management 42, no. 5: 719–729.
Yu, H., A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu. 2024. Evaluation of Retrieval‐Augmented Generation: A Survey.
Yue, X., B. Wang, Z. Chen, K. Zhang, Y. Su, and H. Sun. 2023. “ Automatic Evaluation of Attribution by Large Language Models.” In Findings of the Association for Computational Linguistics: EMNLP 2023, 4615–4635. ACL. https://doi.org/10.18653/v1/2023.findings-emnlp.307.
Zhang, S., H. Xu, Y. Jia, et al. 2023. “Geodeepshovel: A Platform for Building Scientific Database From Geoscience Literature With AI Assistance.” Geoscience Data Journal 10, no. 4: 519–537. https://doi.org/10.1002/gdj3.186.
Zhao, J., S. Huang, and J. M. Cole. 2023. “Opticalbert and Opticaltable‐Sqa: Text‐ and Table‐Based Language Models for the Optical‐Materials Domain.” Journal of Chemical Information and Modeling 63, no. 7: 1961–1981.
Zhao, W. X., K. Zhou, J. Li, et al. 2023. A Survey of Large Language Models.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.