Full text

Turn on search term navigation

© 2015. This work is licensed under http://creativecommons.org/licenses/by/2.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions.

Objective: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes.

Methods: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier.

Results: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound.

Conclusions: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics.

Details

Title
Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
Author
Cole-Lewis, Heather  VIAFID ORCID Logo  ; Varghese, Arun  VIAFID ORCID Logo  ; Sanders, Amy  VIAFID ORCID Logo  ; Schwarz, Mary  VIAFID ORCID Logo  ; Pugatch, Jillian  VIAFID ORCID Logo  ; Augustson, Erik  VIAFID ORCID Logo 
Section
Infodemiology and Infoveillance
Publication year
2015
Publication date
Aug 2015
Publisher
Gunther Eysenbach MD MPH, Associate Professor
e-ISSN
1438-8871
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2512834018
Copyright
© 2015. This work is licensed under http://creativecommons.org/licenses/by/2.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.