Full text

Turn on search term navigation

© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background:The increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging.

Objective:This paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media.

Methods:We proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine.

Results:Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed).

Conclusions:Our RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings.

Details

Title
Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
Author
Das, Sudeshna  VIAFID ORCID Logo  ; Yao Ge  VIAFID ORCID Logo  ; Guo, Yuting  VIAFID ORCID Logo  ; Rajwal, Swati  VIAFID ORCID Logo  ; Hairston, JaMor  VIAFID ORCID Logo  ; Powell, Jeanne  VIAFID ORCID Logo  ; Walker, Drew  VIAFID ORCID Logo  ; Peddireddy, Snigdha  VIAFID ORCID Logo  ; Lakamana, Sahithi  VIAFID ORCID Logo  ; Bozkurt, Selen  VIAFID ORCID Logo  ; Reyna, Matthew  VIAFID ORCID Logo  ; Sameni, Reza  VIAFID ORCID Logo  ; Xiao, Yunyu  VIAFID ORCID Logo  ; Kim, Sangmi  VIAFID ORCID Logo  ; Chandler, Rasheeta  VIAFID ORCID Logo  ; Hernandez, Natalie  VIAFID ORCID Logo  ; Mowery, Danielle  VIAFID ORCID Logo  ; Wightman, Rachel  VIAFID ORCID Logo  ; Love, Jennifer  VIAFID ORCID Logo  ; Spadaro, Anthony  VIAFID ORCID Logo  ; Perrone, Jeanmarie  VIAFID ORCID Logo  ; Sarker, Abeed  VIAFID ORCID Logo 
First page
e66220
Section
Artificial Intelligence
Publication year
2025
Publication date
2025
Publisher
Gunther Eysenbach MD MPH, Associate Professor
e-ISSN
1438-8871
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3222367931
Copyright
© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.