Content area

Abstract

ABSTRACT

The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence‐based decision‐making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi‐faceted visual summaries and semantic grouping capabilities to resolve cross‐document data inconsistencies. A within‐subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high‐quality structured data more efficiently than baseline methods. We discuss design implications for human–AI collaborative systems supporting data extraction tasks.

Details

1009240
Company / organization
Title
SciDaSynth: Interactive Structured Data Extraction From Scientific Literature With Large Language Model
Author
Wang, Xingbo 1   VIAFID ORCID Logo  ; Huey, Samantha L. 2   VIAFID ORCID Logo  ; Sheng, Rui 3 ; Mehta, Saurabh 4   VIAFID ORCID Logo  ; Wang, Fei 4 

 Present Address: Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), Sunnyvale, California, USA, Weill Cornell Medicine, Cornell University, New York, New York, USA 
 Cornell Joan Klein Jacobs Center for Precision Nutrition and Health, Cornell University, Ithaca, New York, USA 
 Hong Kong University of Science and Technology, Hong Kong, Hong Kong 
 Weill Cornell Medicine, Cornell University, New York, New York, USA, Cornell Joan Klein Jacobs Center for Precision Nutrition and Health, Cornell University, Ithaca, New York, USA 
Publication title
Volume
21
Issue
4
Number of pages
17
Publication year
2025
Publication date
Dec 1, 2025
Section
METHODS RESEARCH PAPERS
Publisher
John Wiley & Sons, Inc.
Place of publication
Oslo
Country of publication
United States
e-ISSN
18911803
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-11-03
Milestone dates
2025-06-03 (manuscriptRevised); 2025-11-03 (publishedOnlineFinalForm); 2024-12-20 (manuscriptReceived); 2025-09-04 (manuscriptAccepted)
Publication history
 
 
   First posting date
03 Nov 2025
ProQuest document ID
3268125478
Document URL
https://www.proquest.com/scholarly-journals/scidasynth-interactive-structured-data-extraction/docview/3268125478/se-2?accountid=208611
Copyright
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2026-01-07
Database
3 databases
  • Coronavirus Research Database
  • Education Research Index
  • ProQuest One Academic