Content area
Purpose
Social media platforms that disseminate scientific information to the public during the COVID-19 pandemic highlighted the importance of the topic of scientific communication. Content creators in the field, as well as researchers who study the impact of scientific information online, are interested in how people react to these information resources and how they judge them. This study aims to devise a framework for extracting large social media datasets and find specific feedback to content delivery, enabling scientific content creators to gain insights into how the public perceives scientific information.
Design/methodology/approach
To collect public reactions to scientific information, the study focused on Twitter users who are doctors, researchers, science communicators or representatives of research institutes, and processed their replies for two years from the start of the pandemic. The study aimed in developing a solution powered by topic modeling enhanced by manual validation and other machine learning techniques, such as word embeddings, that is capable of filtering massive social media datasets in search of documents related to reactions to scientific communication. The architecture developed in this paper can be replicated for finding any documents related to niche topics in social media data. As a final step of our framework, we also fine-tuned a large language model to be able to perform the classification task with even more accuracy, forgoing the need of more human validation after the first step.
Findings
We provided a framework capable of receiving a large document dataset, and, with the help of with a small degree of human validation at different stages, is able to filter out documents within the corpus that are relevant to a very underrepresented niche theme inside the database, with much higher precision than traditional state-of-the-art machine learning algorithms. Performance was improved even further by the fine-tuning of a large language model based on BERT, which would allow for the use of such model to classify even larger unseen datasets in search of reactions to scientific communication without the need for further manual validation or topic modeling.
Research limitations/implications
The challenges of scientific communication are even higher with the rampant increase of misinformation in social media, and the difficulty of competing in a saturated attention economy of the social media landscape. Our study aimed at creating a solution that could be used by scientific content creators to better locate and understand constructive feedback toward their content and how it is received, which can be hidden as a minor subject between hundreds of thousands of comments. By leveraging an ensemble of techniques ranging from heuristics to state-of-the-art machine learning algorithms, we created a framework that is able to detect texts related to very niche subjects in very large datasets, with just a small amount of examples of texts related to the subject being given as input.
Practical implications
With this tool, scientific content creators can sift through their social media following and quickly understand how to adapt their content to their current user’s needs and standards of content consumption.
Originality/value
This study aimed to find reactions to scientific communication in social media. We applied three methods with human intervention and compared their performance. This study shows for the first time, the topics of interest which were discussed in Brazil during the COVID-19 pandemic.
1. Introduction
The COVID-19 pandemic allowed the observation of information-seeking behavior and the effects of the availability of diverse information resources on the population and individuals. The acceptance of information might have had even an impact on the individual health, as a study on Brazil suggests (Burni et al., 2023). Social media platforms have emerged as crucial channels for diverse scientific communication content, addressing the unique informational demands that arise from times of crisis. Much misinformation was observed during the COVID-19 crisis online (Langguth et al., 2023) and responses applying AI have been developed (Nakov et al., 2022).
Recipients of medical and other crucial information face a considerable challenge in discerning its reliability and trustworthiness (Barnwal et al., 2019). The state of knowledge in a society depends on the available sources and during the COVID-19 crisis, there was much distorted information available (Campolino et al., 2022). The design of scientific communication artifacts is decisive for the human information-seeking behavior during crises (Soroya et al., 2021). Media creators, in particular, must learn to discern the different criteria by which their content and communication is consumed and judged by their followers (Jaki, 2021). The design of multimodal scientific information can be manifold and many options exist. Further exploration of optimal methods for disseminating scientific information are still necessary (Rodríguez Estrada and Davis, 2015).
This study zeroes in on dissecting online discussions surrounding the COVID-19 crisis, with a specific emphasis on science communication. The primary objective is to create a solution that aids creators of social media channels in navigating a large volume of feedback to their content, helping them filter out constructive feedback in regard to their own method of communication and the design of their products. Our focus is directed toward identifying and analyzing a distinct subset of comments posted in science communication channels in response to the presented content. Such an approach has not been carried out for science communication to the best of our knowledge. For the purposes of this study, 1.12 million tweets were collected, constituting comments from a network comprised of Brazilian scientists, governmental bodies, doctors and scientific communicators. It is noteworthy, that the majority of these comments are part of the broader discourse on the COVID-19 crisis, incorporate political viewpoints and general crisis-related commentary. It means that they are not relevant for our study. This is because of the discursive character of social media platforms such as X (formerly Twitter).
Initiating efforts to filter this data involved keyword search, however, the diversity of this content is too large for allowing successful search based on a few words. As a next step, traditional topic modeling algorithms, such as latent Dirichlet allocation (LDA) (Blei et al., 2003) were applied. However, these methods revealed inadequacies in handling niche topics within short documents and large data collections (Cerqueira de Lima et al., 2023; Mandl et al., 2023). Following manual validation of the topic model, an ensemble method for document filtering was devised. This method involved constructing a word dictionary composed of the most relevant words in topics related to scientific communication and their top-n closest neighbors based on cosine similarity in word embeddings. Implementing a filtering heuristic employing this dictionary markedly improved our ability to filter documents associated with scientific communication (see Figure 1). With the filtered documents, we were able to craft a large enough dataset to train a classifier model that is capable of effectively detecting if a certain tweet is relevant feedback to the content presented or not. This solution is also theme-agnostic, and can be applied to a plethora of niche topics.
The remainder of the article is structured as follows. The next section reviews relevant literature. Section 3 is dedicated to the description of our methodology. It includes issues of data collection and pre-processing. The following section presents the results in terms of the discovered topics and the classification accuracy of three diverse approaches. The paper terminates with conclusions.
2. Literature review
The following review of relevant literature first emphasizes issues related to science communication. In the following subsection, the technologies for analyzing topics in social media discourse are briefly analyzed.
2.1 Science communication during the COVID-19 pandemic
The communication about science is of particular interest in societies in which there are polarized opinions about the pandemic. Brazil was highly divided on many issues regarding the COVID-19 crisis (Pontalti Monari et al., 2020; Sampaio et al., 2023; Peci et al., 2023; Di Giulio et al., 2023) and scientific traditions also differ from other countries (Felipe Barreto de Souza Martins and Domahidi, 2023).
The analysis of popular science communication channels for the Brazilian market (e.g. governmental bodies such as the Butantã Institute or communicators such as the microbiologist Átila Iamarino) shows that they use similar communicative strategies but with different emphasis and in different combinations. Aspects which are relevant for the analysis are, among others, the communication of scientific insecurity (Gustafson and Rice, 2019), the degree of factuality, the potential use of emotions, the use of terminology (Jaki, 2018), the use of humor (Martin Neira et al., 2023) or the (self-)presentation of experts (Jaki, 2020). Communication on social media is multimodal in nature. Visual aspects including body language of actors play a big role. The use of hashtags, emojis and visual information in the form of, e.g. infographics has been analyzed in a mixed-method approach for the Brazilian social media market (Marx et al., 2023).
To gain further insights into how epidemiological information is communicated to non-expert audiences and how they are perceived, we intend to extract relevant comments by the online audience which can be interesting to study public communication of science. There is a lack of research on reactions from audiences. General datasets for social media communication on the COVID-19 crisis (Raza et al., 2022; Arbane et al., 2023) and even the Brazilian market (De Melo and Figueiredo, 2020) are available. However, it poses a considerable challenge to filter this large volume of information for very specific interests. Social media information during the pandemic mainly revolves around political attitudes and general issues. The spread of misinformation is another issue in such datasets (Nakov et al., 2022).
The attitudes toward a vaccine and the emergence of the Omicron variant as they emerge on social media have been intensively studied in an international comparison (Catalan-Matamoros et al., 2023) which revealed that the Brazilian discourse was the only one mentioning one politician explicitly. Furthermore, the attitude in Portuguese tweets toward the Astra Zeneca vaccine was mainly negative when measured with sentiment analysis.
Another approach regarding science communication has been adopted by Francisco et al. They analyzed popular memes and showed how they deal with the crisis and the polarization of the society through irony and sarcasm (Francisco Junior et al., 2023). However, their approach is based on qualitative analysis.
This context underscores the need for comprehensive studies on the quality of science communication, and machine learning techniques can be used to enhance the capabilities of such research. Several studies on the social media discourse during the COVID-19 crisis have already applied topic modeling (Boon-Itt and Skunkan, 2020; De Melo and Figueiredo, 2021; Yin et al., 2022) as well as classification (Samuel et al., 2020; Kwok et al., 2021; Satu et al., 2021) but did not specifically discuss reactions to science communication by the public.
2.2 NLP for topic analysis in social media
Existing studies often adopt specialized perspectives, such as leveraging machine learning to employ sentiment analysis around certain poignant themes such as vaccination (Küçük and Arıcı, 2022). Similarly, LDA has been used to detect topics in the discourse on vaccines and the documents were further processed with deep learning methods to extract to determine the sentiment in a large collection (Zulfiker et al., 2022).
When dealing with short textual data, topic modeling techniques proof helpful in clustering similar documents and identifying common themes. A prior attempt at sifting through the theme of reactions to scientific communication, focusing solely on tweets reacting to science communication, using LDA modeling, yielded only a limited amount of comments (Cerqueira de Lima et al., 2023). Applying LDA (Kalepalli et al., 2020) is effective in generating diverse topics, modern solutions like BERTopic and Top2Vec, based on transformer architectures, outperform LDA in topic coherence and topic accuracy (Egger and Yu, 2022; Vianna et al., 2023).
Much of this work is conducted for English language (Ng et al., 2022), highlighting the necessity for research on other languages such as Portuguese.
The search for reliable topic modeling representing niche topics is a strong quality of Guided BERTopic’s strategy, which pushes topics together based on word similarity, producing satisfying results when seeking a specific theme within a corpus (Grootendorst, 2022).
The evaluation of topic models and the human interpretability of models have been discussed widely. Commonly used metrics like coherence may not accurately reflect actual human interpretability (Ramírez et al., 2012), making a comparison to human judgments questionable (Chang et al., 2009). Sometimes, metrics based on word embeddings, such as cosine similarity, tend to offer more accurate measurements of interpretability (Doogan and Buntine, 2021).
Topic models can be improved, and incorporated as one strategy within an ensemble: constructing word dictionaries to aid machine learning techniques has been proven effective (Reveilhac and Morselli, 2022) and is present in several applications of BERTopic strategies. Employing this technique becomes especially helpful when dealing with word embeddings, as demonstrated by the success of combining the Word2Vec model with word lists or lexicons. Such combinations of technologies enhance the capability of word embeddings within a model, particularly when there is a need to filter specific information from a very small class within a large dataset (Hu et al., 2017; Koufakou and Scott, 2020; Jin et al., 2018).
State-of-the-art machine learning models, especially when related to natural language processing (NLP), such as large language models (LLMs), can have their large number of parameters and ample training time fine-tuned toward certain specific solutions (Azhar and Khodra, 2020; Modha et al., 2022). Fine-tuning a model toward classification generally involves feeding it labeled data with some examples of the classification task desired or training it via prompt repetition. These models excel in several text-related tasks, such as classification, sentiment analysis, copy-editing and many others (Arora et al., 2023).
3. Methodology
This study went through an iterative process between techniques: from beginning with an topic model based on LDA, we repurposed its results into a validated dataset that was used to train a Word2Vec model (Modha et al., 2022), as well as nudging BERTopic into better generalizing our niche topic. After validating each strategy with a curated dataset, a BERT model was fine-tuned to our data and adapted to a classification task: classifying tweets into whether they provided relevant feedback for scientific communication or not.
In the first stage of our methodology, relevant science communication channels on the platform X were identified manually. This is a manual step and results between different experts could differ. Data from these sources, containing content created during the COVID-19 pandemic, were collected and underwent a sequence of textual data processing techniques, resulting in a final corpus that was prepared for NLP. Lastly, a topic model was created to organize the massive amount of text data into themes which can be understood by humans. The objective of our method was the identification of topics that consisted of terms related to reactions to scientific communication. The effectiveness of the topic model was evaluated using the normalized pointwise mutual information (NPMI) metric.
The results of these topic models were then manually analyzed by our team and classified into certain categories that pertained to science communication and if they were relevant or not. Documents were classified as relevant if they provided feedback on the content’s mode of presentation and in expressing praise or distaste for certain aspects of the content’s presentation.
The most relevant words for the topics with the largest concentration of relevant documents were organized into lists by their categories. From these words, a Word2Vec model was trained on a lemma database formed from the text corpus. Embeddings were created for the sorted word lists, and their top-5 nearest neighbors were added to the lists. With these lists of words, we compared them to the corpus, using the trained Word2Vec model, and measured their proximity by cosine similarity and by overall word count. Using this mixture of techniques, much more relevant documents were found when compared to simple topic modeling by LDA.
The results of the three techniques, were compared by their precision, and their resulting datasets were used to fine-tune a multilingual BERT model in identifying tweets as relevant to theme of feedback to scientific communication or not. If not mentioned otherwise, we used all tools like BERTopic and LDA with standard settings. We overall observed that the influence of the settings and hyper-parameters is lower than that of manual interventions (e.g. word selection, guided topic modeling).
3.1 Data collection and processing
3.1.1 Data collection
We manually selected 46 sources for the Brazilian market based on their relevance to COVID-19 discussion, mostly comprising doctors and research institutes [1]. Their relevance was measured by their follower count and amount of reactions (comments and likes) to their posts, as well as hand-picking for official sources (such as official governmental bodies and health professionals with ties to the ministry of health). Additionally, we included popular science communicators and news aggregators during the pandemic. To collect data, we made requests to the X API to retrieve tweets, retweets and replies from these sources between March 1st, 2020 and March 1st, 2022, resulting in 1.3 million tweets. The collected tweets were stored into nested JSON files that reflected the website’s structure. The collection was limited to tweets which X had flagged with the Portuguese language tag. Our data set represents the most popular and relevant scientific communicators in Brazilian social media during the crisis.
Due to the massive amount of data collected and the X-API and computing power limitations, we conducted this step on a Google Cloud cluster with three virtual machines. This decision allowed us to maintain continuous usage of computing resources during the collection and nesting process and ensured fault-tolerance.
3.1.2 Text processing
The first steps taken were to amass every tweet into a single pool and remove any tweets made by the original source, leaving only replies. The text content of every tweet in this dataset was then stripped of URLs, special characters, emojis and mentions, leaving only the actual text content of each tweet. The next steps were taken with the objective of reducing noise in the text data: the first was the removal of stopwords. We used the union of four stopword lists: a handmade list crafted for the study, the Spacy Portuguese News (Honnibal and Montani (2017)) stopwords list, the NLP Python library Gensim (Rehurek and Sojka, 2011) and the Wordcloud (Oesper et al., 2011) stopwords.
After the deletion of stopwords, the text was tokenized using the Python library Spacy. A further filter was then applied to ensure that only words longer than three characters were considered as valid tokens. These tokens were then lemmatized using Spacy’s library, which uses both rule and lookup-based stemming methods to reduce words to their lemmas. To further reduce overfitting and noise, words that appeared in less than two documents within the collection and those that appeared in more than 99% of the documents were left out.
3.2 Topic modeling
Topic modeling is a technology for analyzing the content of a large collection of text documents (Blei et al., 2003). For a human, this can give a very good overview on content without reading many articles. The topics are presented as a collection of words which characterize this topic. Since topic modeling works unsupervised, it requires no training data, assumptions about content words and can be applied for exploring content without bias. Nevertheless, like most other unsupervised methods, topic modeling requires setting hyper-parameters and often the selection of the number of topics. As a consequence, even the results of an unsupervised method are also influenced by humans.
3.2.1 Latent Dirichlet allocation (LDA)
LDA is a statistical method used for topic modeling which aims at identifying recurring patterns in a collection of documents (Kalepalli et al., 2020). In this study, the objective was to develop a topic model that would uncover topics related to reactions to scientific communication. There are several existing algorithms for topic modeling (Egger and Yu, 2022), and in this study the one chosen was LDA, which, given a fixed number of topics, estimates how much of each document is comprised of each topic, based on the probability distribution of each word belonging to a certain topic (Blei et al., 2003).(1)
Equation (1) represents the three levels in LDA, with variables α and β representing the document-topic density and topic-word density, respectively. These parameters are set for the entire corpus, with Θ being the topic distribution of a document, z a set of N topics, and w a set of N key words.
Topic modeling requires the pre-selection of a number of topics. For this analysis, we applied the NPMI metric to find an optimal number of topics. NPMI evaluates a model by representing the top-n words of each topic as a vector in a semantic space. It calculates their probability of co-occurrence with each other and weighs these vectors by the NPMI of each term. The NPMI metric is the best practice to start with higher numbers of topics in order to no lose specific and narrow topics while still producing interpretable results. In this study, we utilized the Python Gensim (Rehurek and Sojka, 2011) libraries for LDA and its performance metrics, such as coherence and NPMI.
3.2.2 Normalized pointwise mutual information
Models that perform topic modeling include several hyper-parameters which need to be set heuristically. One of them is the number of topics. It is important to choose a number of topics in which the results are still interpretable, but granular enough to avoid very general topics. Because this study aims at identifying a very specific and narrow topic, it is necessary to experiment with a higher numbers of topics.
The NPMI metric is a useful way to evaluate the quality of the topic model because it takes into account the co-occurrence of words, which can reveal underlying patterns and relationships between several topics. By representing the top-n words of each topic as vectors in a semantic space, it is possible to calculate their probability of co-occurrence and weight them by the NPMI of each term (Aletras and Stevenson, 2013).(2)
The PMI metric is used to calculate the probability of two words occurring together. PMI considers the probability of each word occurring individually.(3)
By normalizing PMI with -log(p(wi, wj)), the NPMI metric is able to reduce the impact of rare co-occurrences and increase the weight of more common ones.
By experimenting with different numbers of topics and evaluating the resulting topic models using the NPMI metric, it is possible to find the optimal number of topics for the specific dataset and research question at hand. This approach can help researchers identify meaningful and interpretable topics that capture the underlying themes in the data.
3.2.3 BERTopic
In the last few years, transformer models for language modeling have led to best results in many tasks (Modha et al., 2022). Deep learning models like word embeddings and transformers project language into a latent space such that similar words or sentences obtain similar embeddings in the space. These embeddings are used for further processing. The use of latent representations as input for further processing steps has been suggested earlier (Mandl, 2000; Louwerse et al., 2006), but due to new pre-training methods, there systems outperform many traditional NLP systems (Satapara et al., 2023; Ding et al., 2024). BERTopic has also been applied to discover emerging topics in scientific publications (Kim et al., 2024).
LDA operates with a lexical approach. In contrast, BERTopic provides a topic model that utilizes clustering techniques and weights based on term frequency and inverse document frequency (TF-IDF) (Jones, 2004) values in order to obtain topics which maintain the semantic relationship between words into account (Grootendorst, 2022).
BERTopic is based on the successful BERT transformer model (Devlin et al., 2018) and utilizes its capacity for generating vector representations of words as well as sentences which represent the semantic content very well. BERTopic works by leveraging a pre-trained language model to create document embeddings, which go through dimensionality reduction and clustering through hierarchical density-based spatial clustering for applications with noise (HDBSCAN) (Campello et al., 2013). The most relevant words of each cluster are classified through a class-based variation of TF-IDF (Jones, 2004):(4)
In the equation above, tfx,c represents the frequency of word x in class c, tft represents its frequency in all classes, and A is the average word per class. The resulting value represents the importance of a word in a cluster. This enables the model to generate topic-word distributions for each cluster.
It has been observed that BERTopic manages to maintain the semantic properties of documents better when compared to LDA (Egger and Yu, 2022). Furthermore, it seems to be fairly more robust in use, allows more options for fine-tuning and is less dependent on preprocessing.
3.3 Natural language processing (NLP)
3.3.1 Word2Vec word embeddings
Word2Vec is a model architecture that computes continuous vector representation with words, achieving impressive results in word similarity tasks on very large datasets, at a low computational cost (Mikolov et al., 2013). Its word embeddings are widely used to represent words as vectors.
It utilizes two-layer neural networks trained to reconstruct the linguistic context of words to output a vector space that represents an entire corpus, with each word being assigned a different vector in this space. Word2Vec has two architectures: CBoW, in which it uses the surrounding words to predict the word located in the center of an n-gram, and Skip-gram, in which it uses the central word in an n-gram to predict the surrounding words.
Word2Vec has proven to achieve interesting results in preparing results for both intrinsic and extrinsic tasks, beating several other word embedding techniques (Schnabel et al., 2015), and forming the theoretical base of state-of-the-art word embeddings (Wang et al., 2019). For the purposes of this study, the Word2Vec implementation utilized was made available in Python by the NLP library Gensim (Rehurek and Sojka, 2011).
3.3.2 List based filtering
In this study, we experimented with a mixture of techniques that combined manual validation of topics and the properties of their most relevant words.
Once we manually identified topics that, in their top-30 words, ranked by their relevance (Sievert and Shirley, 2014) and salience (Chuang et al., 2012), contained words that were relevant to the topic of feedback to scientific communication, we set up samples of 1,500 tweets from each of these topics.
The research team, composed of three people, with the help of field experts, manually classified these tweets into six categories, in relation to scientific communication:
- Questions, comments, corrections or suggestions about the content or current theme, typically directed toward the author
- Discussion about scientific communication between users
- Praise or criticism toward the content
- Praise or criticism toward the author
- Political commentary in relation to scientific communication
- General questions about the epidemic
The process of manual classification in these categories followed a consensus-based approach. That means that disagreements were discussed within the annotator group, and a document would be classified when the group reached a unanimous decision. After finishing a batch of documents, a field expert would manually verify each document in accordance with the team’s classification. The final decision on classification was up to the expert’s discretion.
Since the objective of our study was to identify reactions to scientific communication, we only considered the first four categories as relevant for properly identifying reactions to scientific communication that might prove useful to scientific content distributors.
After identifying the tweets, we selected words from the top-50 words in the respective topics that were present in the relevant tweets and added them to a word list. Then, with the trained Word2Vec model, we generated embedding vectors for each of these words, and selected, from our corpus, the five nearest words when measured by their cosine similarity. This approach has shown promising results when evaluating the semantic similarity of word embeddings (Lahitani et al., 2016).
3.3.3 Fine-tuning pretrained large language models
LLMs have been one of the most popular applications of machine learning, with their general capacity for generalization and ease-of-use being a cornerstone for democratizing the access of the general populace to machine learning models. They excel in tasks such as text generation, classification, question answering, summarization and conversation (Guo et al., 2023).
One interesting possibility in the realm of the utilization of LLMs is the capacity of utilizing transfer learning (Zhuang et al., 2020) to specify a task to a model trained for a more generalized tool set (Sun et al., 2019). We selected a pre-trained BERT model for text classification on a multilingual database. No human labeling was involved, but the model was trained to predict masked words in sentences and to predict the following next sentence in a text. Using this method, BERT manages to generalize on the representations of the 104 languages present in the training set (Devlin et al., 2018).
To fine-tune this model, a dataset was created from the results of text filtering employed by our three experiments in topic modeling, and the dataset was split into a training and testing ratio of 0.7/0.3. The entire dataset comprised of 20,000 tweets, of which 1,500 were manually classified as relevant. The dataset was labeled, tokenized and encoded by the model’s encoder, and a custom classification head was added to the top of the pre-trained BERT model. The model was trained for five5 epochs, in batches of 128 documents.
4. Results
4.1 LDA topic modeling
At the start of the experiments, several LDA models were executed and scored by their NPMI metric, searching for the optimal number of topics that allowed for maximum coherence and interpretability, as seen in Figure 2.
In our experiments, we observed that using 20 topics (represented by their inter-topic distance in Figure 2 resulted in the highest NPMI value. Beyond that number, the metric steadily decreased. While fewer topics also yielded decent NPMI results, they failed to adequately capture distinct niches within the dataset, which was necessary for our specific research goal. Given that the topic of interest was not highly representative in the tweets, employing a larger number of topics allowed for more specific and targeted discussions. This approach facilitated the filtering of tweets unrelated to the theme of reactions to scientific communication.
After careful manual validation of the topics (as shown in Figure 3) most relevant and salient terms, we decided on exploring two specific topics, manually validating 750 tweets of each one of them, sorted by their contribution toward the topic. Although many topics overlap for the choice of 20 topics, exploring other values, the overlap was even higher. The data form X reveal that it is hard for topic modeling to separate all topics well. However, we are only interested in some topics for further analysis and as long as these are extracted well, the method is adequate.
This approach using LDA resulted in very few actually relevant documents encountered within a sample of 1,500 documents. While this was expected, given the propensity of the platform X for usage for news publishing and discussion rather than educational content, we believed that more relevant avenues for filtering niche documents could be found either in combining embedding models with the results of our manual validation, or in exploring state-of-the-art topic modeling.
4.2 List based semantic filtering based on word embeddings
As discussed, this technique employed the creation of a dictionary of words relevant to the topic of scientific discussion and enhancing this list with the aid of Word2Vec word embeddings. With this list of relevant words at hand, we decided on two metrics for filtering documents: the relevant word count in each document, and their cosine similarity in relation to the entire list. We decided on a cutoff value that would leave us with a sample of similar size of in relation to our LDA experiment: each document contained at least two relevant words and had an average cosine similarity larger than 0.6.
This sample was also manually validated with the same criteria as the previous experiments, leading to fantastic results: many more documents were classified as relevant, and these documents were also very in line with the content that we wanted to find. Many more comments were in fact discussing the quality and characteristics of the content presented, as well as providing questions, corrections and additions.
4.3 Guided BERTopic
This experiment’s Guided BERTopic model was trained and fine-tuned with a custom KMeans model, a practice that in our experiments with BERTopic led to more varied and coherent topics, and also helped in reducing the amount of themes splintered into several small topics. The model was trained with the list created by the previous experiment as its seed topic list, which nudged BERTopic in creating more topics related to the reaction toward scientific communication.
This resulted in the documents being split into 347 topics of roughly similar size. Going by the relevant tweets that we classified in the previous two experiments, we search for the topics that had the largest amount of those tweets as well as the best ratios of relevant/irrelevant documents. This led us to 10 topics. Refer to the Figure 4 for a visualization of the topics keywords. A sample of the 1,500 most relevant documents of each topic was taken to create a third dataset of the same size as the previous two, to be manually validated by our team and our field experts.
4.4 Comparison between approaches
With three same-sized datasets, we can then compare our different techniques in their capability of filtering a large dataset in search of documents of a niche theme (see Figure 5).
We can see that while the topic modeling approach led to much improvement when compared to LDA, the ensembling of word embeddings and list-based semantic filtering found the largest amount of relevant documents. When judging by their general relevance toward the theme, the documents found by the semantic filtering were also in general more related to direct feedback toward the content creators.
Following a classifier algorithms methodology, we can assume that each sample consisting of 1,500 documents was completely classified as relevant. As such, we can measure the precision of each technique, as seen in Table 1. Table 2 shows the performance of the fine-tuned model in more detail.
4.5 Fine-tuned classifier model
Although the comparison in the previous experiment leads to interesting results in filtering a dataset in search for a niche theme, the effectiveness of text classification was significantly higher on our fine-tuned classifier.
It can be seen that the fine-tuned model has a much higher precision than the previous techniques for both metrics, despite the fact that it was trained on this task in a much smaller dataset. However, the training was much more time and cost intensive – a considerable setback that would make the use of this model computationally extremely expensive when applied to our entire dataset. As such, each technique can be seen as useful in a data pipeline: filtering techniques such as our list-based semantic filter can be employed to reduce a dataset size’s considerably, while also highlighting the desired theme, and the model can be employed on this smaller sample for higher precision of classification.
5. Conclusion
This study aimed to find reactions to posts on scientific communication in social media. Such posts intend to communicate scientific findings mostly to non-experts. We applied three methods which included a human intervention and compared their performance. Although deep learning based classification models most often perform very well for many tasks, the inclusion of a semantic filter in combination with a word embedding model delivered the best results in our study. The same techniques can also be applied when searching for any topic with few posts of heterogeneous vocabulary in collections of documents similar to tweets as far as size is concerned. This is especially necessary when these posts might contain completely different vocabulary and cannot be identified via search technology. The methodology could be readily implemented into a tool for tracking the reactions to a single channel. Such a tool would select and segment reactions which discuss the design of the communication products. These comments would not be part of the frequent political and personal communication.
In this study, we put into the practice the idea that even state-of-the-art models, such as BERTopic, needed a certain degree of heuristic validation in order to be able to generalize niche topics. The application of BERTopic to reactions to science communication is a contribution of our work. The most successful framework in filtering our large dataset required a certain amount of manual validation and heuristics, straying from only the use of classical machine learning applications. Although the most successful approach heavily relied on LLMs in the final step of the methodology it could not be applied without the preceding steps. We assume that the power of current LLMs is still not superior to a methodology which includes human knowledge.
The framework defined in this paper can be applied to help scientific content creators to better locate useful feedback in very large datasets, and can help such creators in how to best direction their content creation procedures in order to maximize knowledge transfer. With this, content creators are able to better understand how to broadcast their knowledge to different audiences, and through different avenues.
Social media data collection is not a trivial and straightforward task. Scientists and content creators do not tend to post educational content on X, and prefer to use the platform to share important news and foster discussion – this characteristic of the social network led to difficulties in finding feedback toward scientific communication, and, when coupled with the closing of the academic access the API of X in March 2023 has led to complications in future studies that might need to perform similar steps in data collection such as the one presented here.
With a large dataset of thousands of tweets classified by their relevance toward our theme, the final result of our study was the fine-tuned BERT classifier. This classifier can be seen as the end goal of our study: a tool for scientific content creators to better understand what their audience seek in their content. Following the framework stated here, scientific communication can be optimized to reach larger audiences. Our processing pipeline also allows for the exploration and analysis of other domain specific themes with much higher precision. For future work, it would be interesting to monitor single channels to observe the efficiency of the method. Furthermore, a study in another domain than reactions to science communication would be interesting. In addition, as more NLP technologies are becoming available, tests with newer systems could also lead to improvements.
This work was funded by the Volkswagen Foundation in Germany (Volkswagenstiftung) with the grant A133902 (Project Information Behavior and Media Discourse during the Corona Crisis: An interdisciplinary Analysis – InDisCo). Further financial support was provided by the Coordination for the Improvement of Higher Education Personnel (CAPES) from Brazil.
Notes1.Examples are https://twitter.com/luizacaires3, https://twitter.com/ocienciaetal
These authors contributed equally to this work.
Figure 1
The complete architecture and methodology applied in this study
[Figure omitted. See PDF]
Figure 2
NPMI coherence analysis by number of topics
[Figure omitted. See PDF]
Figure 3
Inter-topic distance mapping between topics found with LDA
[Figure omitted. See PDF]
Figure 4
The ten topics chosen from the approach using guided BERTopic based on their relevance toward the theme
[Figure omitted. See PDF]
Figure 5
Comparison of the techniques regarding the amount of relevant tweets found by each one in a similarly-sized dataset
[Figure omitted. See PDF]
Table 1
Performance metrics for all techniques for a sample of 1,500 documents
| Technique | Relevant tweets | Irrelevant tweets | Precision |
|---|---|---|---|
| LDA | 86 | 1,414 | 0.06 |
| Guided BERTopic | 393 | 1,107 | 0.26 |
| Word2Vec and list filter | 702 | 798 | 0.47 |
Source(s): Table by authors
Table 2
Metrics of the fine-tuned classifier model for a sample of 20,000 documents
| Class | Precision (%) | Recall (%) | F1-score (%) |
|---|---|---|---|
| Relevant | 84 | 86 | 85 |
| Irrelevant | 61 | 56 | 58 |
Source(s): Table by authors
© Emerald Publishing Limited.
