Full text

Turn on search term navigation

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Featured Application

The study presents an improved and easily obtainable method in terms of automatic smoking classification from unstructured bilingual electronic health records.

Abstract

Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.

Details

Title
Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing
Author
Ye Seul Bae 1   VIAFID ORCID Logo  ; Kim, Kyung Hwan 2 ; Kim, Han Kyul 1   VIAFID ORCID Logo  ; Choi, Sae Won 1   VIAFID ORCID Logo  ; Ko, Taehoon 3 ; Seo, Hee Hwa 1   VIAFID ORCID Logo  ; Hae-Young, Lee 4   VIAFID ORCID Logo  ; Jeon, Hyojin 1 

 Office of Hospital Information, Seoul National University Hospital, Seoul 03080, Korea; [email protected] (Y.S.B.); [email protected] (H.K.K.); [email protected] (S.W.C.); [email protected] (H.H.S.); [email protected] (H.J.) 
 Department of Thoracic & Cardiovascular Surgery, Seoul National University Hospital, Seoul 03080, Korea; Department of Thoracic & Cardiovascular Surgery, College of Medicine, Seoul National University, Seoul 03080, Korea 
 Department of Medical Informatics, The Catholic University of Korea, Seoul 06591, Korea; [email protected] 
 Department of Internal Medicine, Seoul National University Hospital, Seoul 03080, Korea; [email protected] 
First page
8812
Publication year
2021
Publication date
2021
Publisher
MDPI AG
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2580963887
Copyright
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.