Abstract
Well-established cognitive models coming from anthropology have shown that, due to the cognitive constraints that limit our “bandwidth” for social interactions, humans organize their social relations according to a regular structure. In this work, we postulate that similar regularities can be found in other cognitive processes, such as those involving language production. In order to investigate this claim, we analyse a dataset containing tweets of a heterogeneous group of Twitter users (regular users and professional writers). Leveraging a methodology similar to the one used to uncover the well-established social cognitive constraints, we find regularities at both the structural and semantic levels. In the former, we find that a concentric layered structure (which we call ego network of words, in analogy to the ego network of social relationships) very well captures how individuals organise the words they use. The size of the layers in this structure regularly grows (approximately 2-3 times with respect to the previous one) when moving outwards, and the two penultimate external layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of layers of the user. For the semantic analysis, each ring of each ego network is described by a semantic profile, which captures the topics associated with the words in the ring. We find that ring #1 has a special role in the model. It is semantically the most dissimilar and the most diverse among the rings. We also show that the topics that are important in the innermost ring also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words.
Citation: Ollivier K, Boldrini C, Passarella A, Conti M (2022) Structural invariants and semantic fingerprints in the “ego network” of words. PLoS ONE 17(11): e0277182. https://doi.org/10.1371/journal.pone.0277182
Editor: Diego Raphael Amancio, University of Sao Paulo, BRAZIL
Received: February 1, 2022; Accepted: October 21, 2022; Published: November 22, 2022
Copyright: © 2022 Ollivier et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data files are available from the OSF database (osf.io/gmpaz).
Funding: This work was partially funded by the SoBigData++, HumaneAI-Net, and SAI projects. The SoBigData++ project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871042. The HumaneAI-Net project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 952026. The SAI project is supported by the CHIST-ERA grant CHIST-ERA-19-XAI-010, by MUR (grant No. not yet available), FWF (grant No. I 5205), EPSRC (grant No. EP/V055712/1), NCN (grant No. 2020/02/Y/ST6/00064), ETAg (grant No. SLTAT21096), BNSF (grant No. КП-06-ДОО2/5). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
In humans, language production is a deliberate and conscious action. However, it relies on many invisible mental processes that allow the construction of sentences in a very short time. For example, these cognitive processes are at play during the word retrieval stage, when the brain has to efficiently process, in a few milliseconds, its lexicon in order to find the right word, among thousands of others, that best fits the concept that needs to be expressed [1]. In order to achieve this impressive performance, cognitive strategies that exploit language properties, such as word frequency (e.g. when the most frequently used words are retrieved more quickly [2, 3]), are activated. In this paper, we set out to find traces of these cognitive patterns in written production with a data-driven approach. To this end, we rely on the ego network model, which has already uncovered the cognitive limits of another human activity: socialisation.
1.1 The social ego network model
Anthropologists have shown that the number of meaningful social relationships that humans can maintain is not only limited to 150 [4] (the famous Dunbar’s number) but it is also stable over time. The discovery of this regularity in human activity stems from the observation that, in different species of primates, there exists a correlation between the size of the neocortex (the part of the brain dedicated to high-level cognitive functions such as socialisation, language, etc.), and the average size of groups in natural environments. Extrapolating the expected size of a human group from the dimension of the human brain, as well as studying historical data such as the maximum size before fission of autonomous communities [5], the Dunbar number consistently emerges. It was then shown that these 150 active social relationships can be further subdivided into 4 concentric circles [6, 7], the innermost one containing the most intimate social relationships [8], the outermost one enclosing all 150 social relationships. The typical size of these concentric circles is 5, 15, 50, and 150, respectively, with a constant scaling ratio of about 3 between consecutive circles. Note that the portion of a circle not included in its innermost ones is referred to as ring. This hierarchical structure of social relationships is called “ego network”. Recent studies based on data collected from online social networks have shown that online relationships are subject to the same laws as offline ones: the size of the ego network (i.e., the total number of social relationships) remains in the same order of magnitude as the Dunbar’s number, which indicates that the cognitive constraint yielding this number is not overridden by a communication medium that facilitates social interactions [8–11]. In OSNs (Online Social Networks), the typical number of circles is slightly higher than 4, due to the presence of an additional circle in the center of the ego network (containing about 1.5 people), but the scaling ratio is preserved at around 3 (Fig 1).
[Figure omitted. See PDF.]
The green dot symbolizes the ego and the black dots the alters with whom the ego maintains an active social relationship. A layer also contains the alters of the inner layers, unlike the rings.
1.2 From social ego networks to ego networks of words
The ego network model highlights the regularity of the structure of social relations, in real life and in OSN. In this paper, we adopt an analogous approach to investigate the regularities and invariants manifesting cognitive constraints in language production. Specifically, we conjecture that a similar structure, which we call “ego network of words”, may also be used to describe the way humans use words, and that this structure may provide very significant information to characterise the peculiarities of individuals, similarly to the social dimension. In fact, it is known [12] that many traits of social behavior (resource sharing, collaboration, diffusion of information) are chiefly determined by the structural properties of social ego networks.
The motivation for this analogy is twofold. First, the use of words is, much like socialisation, a process that involves the use of cognitive resources, thus we conjecture that the ego network model may have larger applicability in describing how humans allocate cognitive resources, for example to language. Second, language is a social activity, whose emergence is potentially linked to the surge in active human relationships from the 50 of the closest primate to 150 for humans. This theory, known as social gossip theory of language evolution [13], postulates that language facilitates grooming social relations by reaching several peers at the same time. In addition, there is already well-established knowledge of a number of empirical cognitive limits affecting language, such as the bounded size of our vocabulary (which is consistently limited to approximately 42, 000 words for a native 20-year-old English speaker [14]), as well as the Zipf’s law of words [15], which states that the frequency of a word is inversely proportional to its position in the frequency table for most human writings. We, therefore, choose to study the individual distribution of vocabulary, by forming concentric circles of words according to their frequency of use by the ego in question. Then, going beyond words as units of language, we focus on the topics to which the words refer. We thus complement the structural analysis with a semantic study, which completes our cognitive analysis framework. In the same way that the social ego network model has been used to provide a different perspective to social network analysis (such as for information diffusion [16]), we want to leverage the ego networks of words as microscopes to discover novel properties of language production.
1.3 Contribution and key findings
The main contribution of this work is the structural and semantic analysis of the ego networks of words for Twitter users. By using the ego network model, in this paper, we uncover complex structures showing that the cognitive effort to organise one’s vocabulary is limited in many ways. We choose a corpus of text made up of tweets because it allows us to work with a varied sample of “authors” (e.g. more varied than a corpus of newspaper articles). Moreover, as Twitter is dedicated to the exchange of very short messages (240 characters), it is a medium that is very favourable to spontaneous reactions, with a more natural style and a reduced writing time. This time constraint is more likely to reveal human behaviour, in analogy with the social domain, where time limitations have been shown to significantly affect social cognitive constraints [13]. For our data-driven analysis, we collected tweets from generic as well as specialised Twitter users (Section 3). Using the ego-network-of-words model, we are able to find evidence of a structural regularity in the frequency of word usage by each individual (Section 4). The semantic analysis (Section 5) also establishes the existence of additional invariants, but most importantly it uncovers the nature of the innermost layer as the semantic fingerprint of the whole ego network, i.e., this layer groups together the most important topics on which the user is active. This strengthens the analogy with the social version of the ego network model, where the innermost layers include the most important social relationships of a person.
The key findings of the paper are the following.
* Similarly to the social case, we found that a regular concentric, layered structure (which we call ego network of words in analogy to the ego networks of the social domain) very well captures how an individual organizes their cognitive effort in language production. Specifically, words can be typically grouped in between 5 and 7 layers of decreasing usage frequency moving outwards, regardless of the specific class of users (regular vs professional).
* One structural invariant is observed for the size of the layers, which approximately doubles when moving from layer i to layer i+ 1. The only exception is the innermost layer, which tends to be approximately 5 five times smaller than the next one. This suggests that the innermost layer, the one containing the most used words, may be drastically different from the others.
* A second structural invariant emerges for the external layers. Users with more layers organise differently their innermost layers, without modifying significantly the size of the most external ones. In fact, while the size of all layers beyond the first one linearly increases with the most external layer size, the second-last and third-last layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of layers of the user.
* The semantic analysis of the words contained in the ego networks confirms that layer #1 is exceptional in the ego networks of words: it generates proportionally more topics than the other rings, these topics are more diverse, and its overall semantic profile is the most different with respect to those of other rings.
* In addition, topics that are important in ring #1 tend to be important in other rings as well (we call this the pulling power of ring #1). Thus, layer #1, despite being the smallest, can be seen as the semantic fingerprint of the ego network of words.
* The topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of the other rings. This shows that, while layer #1 provides a particularly strong signal about prevalence in the ego networks, weaker signals show a more complex structure of influence among topics “resident” in different layers of the ego network of words.
This paper extends our prior publication in [17], where the structural analysis was carried out. Specifically, in this paper, we also present an extensive semantic analysis of the ego network of words. This allows us to provide a much more comprehensive understanding of the model, and highlight ways to characterise specificities of individuals as they emerge from their use of words, in addition to structural invariants observed through the structural properties of the ego networks.
2 Related work
To the best of our knowledge, no work has been published yet on models of individual word organisation similar in spirit to ours (i.e., by exploring the analogy with the social ego network model). However, some work has already been done on individual word frequency distribution by extending the notion of Zipf’s law [18]. Based on Zipf’s law, some have tried to find a generative model that could explain such a regularity-based human cognition [19], or just how the limited capacities of our memory naturally constrain our long-term use of words [20]. More generally, vocabulary size is often studied in the context of language learning for both children and adults, as well as to detect possible cognitive impairments [21]. For the semantic part, we have not identified any previous work on modelling user interests with a stratified approach, such as ours, that relies on the ego network of words. Most publications are about topic recommendations (relying upon a wide range of techniques, such as hashtag analysis [22], LDA [23] or ontology databases [24]), and about the emergence and monitoring of trending topics on Twitter [25, 26].
3 The dataset
The analysis is built upon four datasets extracted from Twitter, using the official Search and Streaming APIs (note that the number of downloadable tweets—at the time of download—was limited to the most recent 3200 tweets per user). Each of them is based on the tweets issued by users in four distinct groups:
1. Journalists Extracted from a Twitter list containing New York Times journalists (https://twitter.com/i/lists/54340435), created by the New York Times itself. It includes 678 accounts, whose timelines have been downloaded on February 16th, 2018.
2. Science writers Extracted from a Twitter list created by Jennifer Frazer (https://twitter.com/i/lists/52528869), a science writer at Scientific American. The group is composed of 497 accounts and has been downloaded on June 20th, 2018.
3. Random users #1 This group has been collected by sampling among the accounts that posted a tweet or a retweet in English with the hashtag #MondayMotivation (at the download time, on January 16th, 2020). This hashtag is chosen in order to obtain a diversified sample of users: it is broadly used and does not refer to a specific event or a political issue. This group contains 5183 accounts after bot filtering.
4. Random users #2 This group has been collected by sampling among the accounts that posted a tweet or a retweet in English, from the United Kingdom (we set up a filter based on the language and country), at download time on February 11th, 2020. This group contains 2733 accounts after bot removal.
These four groups are chosen to cover different types of users: the first two contain accounts that use language professionally (journalists and science writers) and the other two contain regular users, which are expected to be more colloquial and less controlled in the language they use. Since the random user accounts are not handpicked as in the two first groups, we need to make sure that they represent real humans. The probability that an account is a bot is calculated with the Botometer service [27], which implements a state-of-the-art bot detection algorithm. This probability that the account is not human, which is called “complete automation probability” (CAP), is not only based on linguistic features such as grammatical tags, or the number of words in a tweet, but also on language-agnostic features like the number of followers or the tweeting frequency [28]. There is no standard CAP threshold to easily separate bots from humans: it depends on the expected balance of precision and recall. That is why we discard accounts with a CAP higher than 0.5, which considerably limits the number of false negatives (undetected bots). The Botometer service achieves a performance of 0.95 AUC on standard bot detection datasets [27]. With this configuration, the algorithm detects 29% of bot accounts in the dataset of random users#1 and 23% in the dataset of random users#2.
In our analysis, we only consider the timelines of active Twitter accounts, i.e., users that tweet regularly. Since this preprocessing step largely follows the standard approach in the related literature [8, 29], further details are left to the S1 Appendix. Please note that we discard retweets with no associated comments, as they do not include any text written by the target user, and tweets written in a language other than English (since most of the NLP tools needed for our analysis are optimised for the English language).
3.1 Extracting user timelines with the same observation period
As discussed above, for each user in our datasets we retrieved the most recent 3200 tweets (due to the Twitter API limitation), which constitute the observed timeline of the user. The time period covered by these tweets varies according to the frequency with which the account is tweeting: for very active users, the last 3200 tweets will only cover a short time span. Since random users are generally more active, their observation period is shorter, and this may create a significant sampling bias. In fact, the length of the observation period affects the measured word usage frequencies (specifically, we cannot observe frequencies lower than the inverse of the observation period). In order to guarantee a fair comparison across user categories and to be able to compare users with different tweeting activities without introducing biases, we choose to work on timelines with the same duration, by restricting to an observation window T. To obtain timelines that have the same observation window T (in years), we delete all those with a duration shorter than T and remove tweets written more than T years ago from the remaining ones.
Increasing T reduces the number of users we can keep for our analysis (see Fig 2): for a T larger than 2 years, that number is halved, and for a T larger than 3 years, it falls below 500 for all datasets. On the contrary, the average number of tweets per timeline increases linearly with T (Fig 3). The choice of an observation window will then result from a trade-off between a high number of timelines per dataset and a large average number of tweets per timeline. To simplify the choice of T, we only select round numbers of years. We can read in Table 1 that, beyond 3 years, the number of users falls below 100 for some datasets. On the other hand, the number of tweets for T = 1 year remains acceptable (> 500). Since we value the diversity of users (in order to limit any bias in the selection of Twitter accounts) over the number of tweets available, we make the choice of T = 1 year for the entire paper. Results with other T lengths can be found in [17]. We note that random users have a higher frequency of tweeting than others. This difference tends to smooth out when the observation period is longer (Table 1). This can be explained by the fact that the timelines with the highest tweeting frequency are excluded in that case because their observation period is too small (which further supports the fact that a smaller T reduces the selection bias of users).
[Figure omitted. See PDF.]
Number of selected timelines depending on the observation window.
[Figure omitted. See PDF.]
Average number of tweets depending on the observation window. The Pearson linear correlation coefficient is equal to or greater than .98 for the four datasets.
[Figure omitted. See PDF.]
Number of users and tweeting frequency at different observation windows.
4 Structural analysis of the ego network of words
In this section, we focus on the analysis of structural properties of the ego network of words, highlighting structural invariants in language production. Note that, in the social domain, pure structural properties of ego networks were instrumental [12] in characterising many traits of social behavior (resource sharing, collaboration, diffusion of information). For this reason, we believe it is important to assess them in the language domain as well, before moving on (Section 5) to more complex and domain-specific analyses.
We first describe the methodology we use for our analysis in Section 4.1, then we discuss the results in Section 4.2. For ease of reading, the notation used in this section is summarised in Table 2. The section reports only the most significant results obtained by analysing the structural properties of the ego network. Interested readers are referred to [17] for additional results.
[Figure omitted. See PDF.]
4.1 Methods
For each user, acting as ego, we want to build their ego network of words. To this aim, we first extract individual words from the user’s tweets (Section 4.1.1), then we build the actual ego network from these words (Section 4.1.2).
4.1.1 Word extraction.
Since the analysis focus on words and their frequency of use, we take advantage of NLP techniques for extracting them. As a first step, all the syntactic marks that are specific to communication in online social networks (mentions with @, hashtags with #, links, emojis) are discarded (see S1 Appendix for a summary). Once the remaining words are tokenized (i.e., identified as words), those that are used to articulate the sentence (e.g., “with”, “a”, “but”) are dropped. In linguistics, this type of word is called a functional word as opposed to lexical words, which have a meaning independent of the context. These two categories involve different cognitive processes (syntactic for functional words and semantic for lexical words), different parts of the brain [30], and probably different neurological organizations [31]. We are more interested in lexical words because their frequency in written production depends on the author’s intentions, as opposed to functional word frequencies that depend on language characteristics. Functional words may also depend on the style of an author (and due to this they are often used in stylometry). Still, whether their usage requires a significant cognitive effort is arguable, hence in this work, we opted for their removal. Moreover, lexical words represent the biggest part of the vocabulary. Functional words are generally called stop-words in the NLP domain and we simply used an already existing list from the library spaCy [32] to remove them.
As this work will leverage word frequencies as a proxy for discovering cognitive properties, we need to group words derived from the same root (e.g. “work” and “worked”) in order to calculate their number of occurrences. This operation can be achieved with two methods: stemming and lemmatization. Stemming algorithms generally remove the last letters thanks to complex heuristics, whereas lemmatization uses the dictionary and a real morphological analysis of the word to find its normalized form. Stemming is faster, but it may cause some mistakes in overstemming and understemming. For this reason, we choose to perform lemmatization with the help of the package WordNetLemmatizer from the library NLTK [33] (which leverages the lexical database WordNet). Once we have obtained the number of occurrences for each word base, we remove all those that appear only once to leave out the majority of misspelled words. The S1 Appendix contains examples of the entire preprocessing part.
In the remaining of the paper, when we talk about the “words” of a user, we refer to the set of words left after removing functional words and after lemmatization.
4.1.2 Building the ego network of words.
Let us focus on a user j. When studying the social cognitive constraints [28], the contact frequency between two people was taken as a proxy for their intimacy and, as a result, for their cognitive effort in nurturing the relationship. Similarly, the frequency fi at which user j uses word i is considered here as a proxy of their “relationship”. Frequency fi is given by , where nij denotes the number of occurrences of word i in user j’s timeline, and T denotes the observation window of j’s timeline in years (T = 1y in our case, as discussed in Section 3.1). Using this frequency definition, we now investigate whether the words of a user can be grouped into homogeneous classes and whether different users feature a similar number and sizes of classes. To this aim, for each user, we leverage a clustering algorithm to group words with a similar frequency. The selected algorithm is Mean Shift [34], because as opposed to Jenks [35] or k-means [36], it is able to automatically detect the optimal number of clusters. In order to account for the long-tailed nature of frequencies, a standard log-transformation is applied to the frequency values prior to the Mean Shift run.
Thus, for each user, we feed the user’s words to Mean Shift. The output of the clustering process is one value τ(e) for each ego network e, which describes the optimal number of classes (clusters) in which the word frequencies can be split. We rank each cluster by its position in the frequency distribution: cluster #1 is the one that contains the most frequent words, and the last cluster is the one that contains the least used words. Following the convention of the social ego network model discussed in Section 1, these clusters can be mapped into concentric layers (or circles), which provide a cumulative view of word usage. Specifically, layer includes all clusters from the first to the i-th. Layers provide a convenient grouping of words used at least at a certain frequency. We refer to this layered structure as the ego network of words. Note that, since layers in ego networks are cumulative (i.e., they include all words used at least a certain frequency), we will use the term “ring” to refer to their non-overlapping portion: for example, ring #2 contains all words that are in but not in (see Table 4 for the general formula). For the sake of example, let us focus on the second cluster identified by Mean Shift: cluster #2 corresponds to ring #2 in the ego network, and the union of ring #1 and ring #2 corresponds to the 2nd layer of the ego network. Another typical metric that is analysed in the context of social cognitive constraints is the scaling ratio ρi between layers i and i − 1, which, as discussed earlier, corresponds to the ratio between the size of consecutive layers (see Table 4 for its formula). The scaling ratio is an important measure of regularity, as it captures a relative pattern across layers, beyond the absolute values of their size. Taken together, the optimal number of layers τ(e), the circle , and the scaling ratio fully characterise the ego network e.
4.2 Results
Here we study the ego networks of words in our four datasets, following the methodology described above.
The histograms of the obtained optimal number of layers τ are shown in Fig 4. It is interesting to note that, despite the heterogeneity of users (in terms of tweeting frequency), the distributions are always quite narrow, with peaks appearing consistently between 5 and 7 clusters. Similarly to the social constraints case, also for language production, we observe a fairly regular and consistent structure. This is the first important result of the paper, hinting at the existence of structural invariants in cognitive processes.
[Figure omitted. See PDF.]
The clusters are obtained by applying Mean Shift to log-transformed frequencies. The most frequent number of clusters is highlighted in red.
We now study the size of the layers identified in Fig 4. For the sake of statistical reliability, we only consider those users whose optimal number of layers (as identified by Mean Shift) corresponds to the most popular number of layers (red bars) in Fig 4. This allows us to have a sufficient number of samples in each class. Fig 5 shows the average layer sizes for every dataset. For a given number of clusters, we observe again a striking regularity across the datasets, meaning that each layer has approximately the same size regardless of the category of users.
[Figure omitted. See PDF.]
Each panel captures egos with a different optimal number of clusters. Error bars correspond to the 95% confidence intervals.
Fig 6 shows the scaling ratio of the layers in language production. We can observe the following general behavior: the scaling ratio starts with a high value between layers #1 and #2, but always gets closer to 2–3 as we move outwards. This empirical rule is valid whatever the dataset (and whatever the observation period [17]). This is another significant structural regularity, quite similar to the one found for social ego networks, as a further hint of cognitive constraints behind the way humans organise the words they use.
[Figure omitted. See PDF.]
Each panel captures egos with a different optimal number of clusters. Error bars correspond to the 95% confidence intervals.
In order to further investigate the structure of the word clusters, we compute the linear regression coefficients between the total number of unique words used by each user (corresponding to the size of the outermost layer) and the individual layer sizes. Due to space limits, in Table 3 we only report the exact coefficients for the journalists’ dataset (but analogous results are obtained for the other categories) and in Fig 7 we plot the linear regression for all the user categories. Note that the size of the most external cluster is basically the total number of words used by an individual in the observation window. It is thus interesting to see what happens when this number increases, i.e., if users who use more words distribute them uniformly across the clusters, or not. Table 3 shows two interesting features. First, it shows another regularity, as the size of all layers linearly increases with the most external cluster size, with the exception of the first one (Fig 7). Moreover, it is quite interesting to observe that the second-last and third-last layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of clusters. This indicates that users with more clusters split, at a finer granularity, words used at the highest frequencies, i.e., they organise differently their innermost clusters, without modifying significantly the size of the most external ones.
[Figure omitted. See PDF.]
The x-axis corresponds to the total number of unique words used by each user (corresponding to the size of the outermost layer), the y-axis to the individual layer sizes.
[Figure omitted. See PDF.]
We report the linear regression coefficients obtained for the journalists dataset with T = 1 year.
As a final comment on Fig 6, please note that the innermost layer tends to be approximately five times smaller than the next one. This suggests that this layer, containing the most used words, may be drastically different from the others (as also evident from Table 3). The characterization of this special layer will be the main focus of the next section.
4.3 Discussion
We summarise below the main results of the section.
* Individual distributions of word frequencies are divided into a consistent number of groups. Since word frequencies impact the cognitive processes underlying word learning and retrieval in the mental lexicon [37], these groups can be an indirect trace of these processes’ properties. The number of groups is only marginally affected by the class (specialized or generic) the users belong.
* Structural invariants in terms of layer sizes and scaling ratio are observed, similarly to the well-known results from the social domain [8]. Specifically, we found that the size of the layers approximately doubles when moving from layer i to layer i + 1, with the only exception of the first layer.
* Users with more layers organise differently their innermost layer, without modifying significantly the size of the most external ones, which consistently account for approximately 60% and 30% of the used words, irrespective of the number of clusters of the user.
5 Semantic analysis of the ego network of words
We have treated words as simple tokens so far. However, words have meanings and they can be linked to specific topics. In this section, we want to go beyond words and investigate which topics they refer to and how they are distributed in the different rings of the ego network. The analysis of this section revolves around the concept of semantic profile of a ring (in the ego network of words), which captures the topics associated with the words in the ring. Once semantic profiles are obtained, we are able to address the following high-level question: are all rings similar in the topics they contain, or does the ego network organize the topics in its rings in a specific way?
For the convenience of the reader, we summarise in Table 4 the notation used throughout the section.
[Figure omitted. See PDF.]
5.1 How to build semantic profiles
In this section, we describe how we carry out the semantic analysis of the ego network of words. First, in Section 5.1.1, we motivate our selection of the BERTopic framework for topic extraction. Then, in Section 5.1.2, we illustrate the steps for topic extraction. At the end of this process, each word occurrence in the ego network is associated with a specific topic. Accounting for the popularity of each topic in the rings of the ego network, in Section 5.1.3 we build the semantic profile of the ego network ring, as the topic distribution of the words in that ring.
5.1.1 Preliminaries.
To calculate a semantic profile, we choose to consider the meaning of each word in its context rather than using a semantic dictionary [38] (a dataset where each word is mapped to a semantic category), which would not be able to detect more complex topics and would miss some meanings for a polysemous word. We acknowledge that a lot of effort has been put in the direction of ontologies in order to understand more precisely the interests of users, specifically on Twitter. Ontologies map knowledge of specific domains, such as Athena [24], which is a semantic web database extracted from a news portal that can be used for news recommendation purposes [39], or the BBC ontologies extracted from the BBC corpus of news, which allows politically-oriented topic mining [40]. However, even if their drawbacks (such as the rigidity of the knowledge model) can be partly fixed by coupling them with models based on embedding [41], we prefer having the maximum freedom in the topic identification process by using a transformers-based model such as BERT [42] which is the current state of the art in text embedding and then using an unsupervised method to detect topics.
5.1.2 Extraction of the topics.
In order to avoid some issues with polysemous words, we must consider the ring of an ego network not only as a set of single words associated with a frequency of use but as a set of words with a given number of occurrences (from which the frequency is derived), each occurrence belonging to a user’s tweet. We aim to associate each word occurrence with a topic. We first classify (in an unsupervised way) the tweets by topic using the BERTopic framework [43], then all word occurrences that constitute a tweet are assigned the same topic as the tweet itself (Fig 8).
[Figure omitted. See PDF.]
(1) The ego network’s rings organize a user’s vocabulary based on the frequencies of the words. (2) For a given word, its occurrences in the user timeline are coming most likely from different tweets. (3) The tweets are classified by topic thanks to the BERTopic framework. (4) Each word occurrence is assigned the very same topic as the tweets it belongs to. (5) If we consider a ring as a multiset of words (with repetitions) the semantic profile is the distribution of the topics among those words.
For the current analysis, we chose to focus only on ego networks with six rings, the case covering the most users. As described in the following, the BERTopic framework uses sequentially BERT [42] for tweet embedding, UMAP [44] for dimension reduction, and HDBSCAN [45] for clustering those tweet embeddings in a low-dimensional subspace.
5.1.2.1 Tweet embedding with BERT. BERT [42], which achieves state-of-the-art performance for natural language understanding, is used to assign to each tweet a point in the embedding space which is supposed to be a vector representation of its semantic meaning. BERT is a bidirectional transformer developed by Google, trained on the BookCorpus [46] and Wikipedia in English. It, therefore, relies on all the linguistic knowledge learned from a very large corpus to perform this task. BERT yields topics along 768 dimensions.
5.1.2.2 Dimensionality reduction with UMAP. In order to mitigate the curse of dimensionality (to which clustering algorithm based on k-nearest neighbors are particularly sensible [47]), we use the UMAP clustering algorithm (with settings n neighbors = 15, n components = 5, metric=’cosine’ and the python package umap v0.1.1) to reduce the embedding space down to five dimensions as recommended in the BERTopic framework [43]. UMAP, like the T-SNE [48] algorithm, is able to capture latent non-linear dimensions but in a more scalable way.
5.1.2.3 HDBSCAN for clustering topics. HDBSCAN [45] is also able to find non-linear cluster structures from the density, as well as outliers, like DBSCAN (Fig 9). However, instead of deciding the contours of a cluster based on a fixed density threshold, HDBSCAN uses hierarchical clustering (single linkage) to find the most stable partition. Here we use HDBSCAN with following settings: min cluster size = 15, metric=’euclidean’, cluster selection method=’eom’, prediction data = True with the python package hdbscan v0.8.26. Thanks to BERT embedding, the clusters of tweets we obtain are semantically homogeneous, and therefore represent the dominant topics of the dataset. Under these conditions, we can consider that a cluster corresponds to a topic.
[Figure omitted. See PDF.]
265 clusters are found (they are the same in both cases). In the first case, each point is classified as either belonging to a single cluster (colored points) or as an outlier (grey point), whereas in the second case each point is assigned a likelihood to belong to each cluster (the points take the color of the cluster they belong to most likely).
Table 5 shows the percentage of outliers detected by HDBSCAN, which corresponds to the percentage of tweets that cannot be associated with a specific topic. Since this percentage is quite high, even with the most conservative configurations (with the least outliers), we also assess the cluster configuration (i.e., the topic assignment) induced by a soft clustering approach. Indeed HDBSCAN allows two types of clustering: hard clustering, which classifies each tweet in one and only one cluster (or as an outlier), and soft clustering, which is able to measure the proximity of a tweet to several different clusters. The advantage is that it is possible to obtain this proximity even for outliers, which allows us to integrate them into the analysis. When using it for soft clustering, HDBSCAN provides, for each point (tweet) m, a probability distribution Pm such that Pm(c) is the likelihood that this point belongs to the cluster (topic) c, with ( being the set of topics). Thus, with soft clustering, the tweet is not assigned a single topic but a probability distribution over all the topics. For clarity reasons, in the case of hard clustering—where the tweet m is directly assigned one topic cm—let us use the same notation Pm, where Pm(cm) is equal to 1 and zero otherwise. We will use these two configurations (hard clustering and soft clustering) to build two separate semantic profiles for each ego network ring. In S1 Appendix we discuss in detail why hard clustering is better suited for our analysis.
[Figure omitted. See PDF.]
Each topic corresponds to a cluster identified by HDBSCAN.
5.1.2.4 Reduction of the number of topics. As shown in Table 5, the different datasets feature a different number of topics. In order to be able to compare the datasets, we reduced the number of topics down to the same number of topics (this set of topics—which is different for each dataset—will be noted as from now on). Let us denote with the full set of topics. Our goal is to merge them together until we obtain the target number of topics. To do so, the following operation is repeated: merge the smallest cluster (in the hard clustered configuration) with the cluster to which is semantically the closest. This semantic similarity is calculated as follows: all the tweets are grouped in a single document by cluster, then a TF-IDF vector is calculated for each of them. The similarity between the two topics is the cosine of their TF-IDF representation. The probability of the new topic is accordingly updated, for each tweet m, as . When merging step by step the clusters, the average similarity between them increases as can be seen in Fig 10. In the case of journalists and science writers, we see that exceeding 100 topics no longer allows the emergence of topics that are radically different from the others, while still enabling an acceptable number of topics to be isolated. Thus, in order to be able to compare the results related to the different datasets, we have chosen to limit the number of topics to 100 for each of them. For the sake of comparison, the 100 topics obtained for the hard clustering configuration are also used for topic reduction in the soft clustering case. This operation allows us to narrow down to one hundred topics the different semantic fields addressed in the same dataset while trying to provoke the least changes in the topic reassignment.
[Figure omitted. See PDF.]
The threshold of one hundred topics is marked with the dashed red line. This threshold is situated at the end of the bend for specialized datasets, and in the middle of the bend for both random datasets.
5.1.3 Extraction of the semantic profile.
We define the semantic profile of an ego network ring as the distribution of topics to which the word occurrences that the ring contains (multiple occurrences of the same word may come from different contexts and thus refer to different topics) belong. Note that this analysis is carried out at the ring level, and not the circle level because circles are concentric and cumulative, thus the semantic profiles of circles would include by default overlapping topics, hence creating a bias in the analysis (similarly to counting topics twice). After the preprocessing described in the previous section, each word occurrence is associated with a topic (or several, in the soft clustered case), thus we can compute for each ego network’s ring a topic distribution based on the word occurrences it contains.
Let be the set of word occurrences contained in ring r of the ego network e, and m(w) the tweet the word occurrence w belongs to. The probability of observing topic c in ring r of ego network e is defined as follows:(1)where . More in general, we denote with the semantic profile of ring r in ego network e (depicted in Fig 11). For this reason, we will also refer to as the share of c in the semantic profile of r This unique semantic profile will be the starting point for all subsequent analyses in this section. In S1 Appendix, we provide four tables (one for each dataset) that detail for every topic the most characteristic words and the average share in the rings.
[Figure omitted. See PDF.]
Each ring is associated with a topic distribution. Note: Two different semantic profiles can be built, depending on whether topics are assigned using hard vs soft clustering. In S1 Appendix we show that the use of soft clustering (and thus the inclusion of outliers) does not improve the reliability of the analysis. It gives too much importance to noisy data which favors the emergence of very generalized “super topics” that dominate all semantic profiles. We, therefore, present in Section 5.3 only the results obtained with hard clustering. In S1 Appendix we discuss soft versus hard clustering in detail and motivate why hard clustering is better suited for our analysis.
5.2 Metrics for the analysis of semantic profiles
After following the steps described in Section 5.1, we end up with a semantic profile for each ring of an ego network. In the following we discuss (i) how to characterise individual semantic profiles (Section 5.2.1), (ii) how to compare semantic profiles (Section 5.2.2), and (iii) how to leverage semantic profiles to investigate the role of the most important topics (Section 5.2.3).
5.2.1 Characterization of the semantic profile.
Let us consider a ring r of ego network e for which we have extracted the semantic profile as discussed above. The semantic profile tells us how many distinct topics the words in ring r touch upon. Formally, the number of topics associated with a given ring can be calculated as follows:(2)where we denoted with the probability of a observing topic c in the semantic profile of ring r, and is the indicator function. Note, though, that may offer only a partial perspective. In fact, rings have very different sizes (as discussed in Section 4) and it is expected to be much easier for larger rings (i.e., rings containing many words) to span a larger range of topics. For this reason, we will compare with its normalised version:(3)where we weigh the number of topics “generated” by the ring by the number of word occurrences contained in the ring (denoted with ).
and account for the mere presence of topics, regardless of their frequency of use. To capture the latter dimension, we next measure the entropy of . Recalling that is in fact a probability distribution, its Shannon entropy reflects its diversity: the entropy (and diversity) is maximum if a ring contains all topics equally (i.e., with the same values of ), while the entropy is minimum if a ring contains only one topic. So, the greater the entropy, the greater the diversity. Denoting with H(e, r) the entropy of the ring r in ego e, its definition is as follows:(4)For the 100 topics we consider, the minimum entropy is 0 and the maximum entropy is about 4.60.
In Section 5.3, the average of , , and H(e, r) across all ego networks will be presented, i.e., (analogously for the others).
5.2.2 Comparing the semantic profiles of different rings.
Once we know which topics are covered by each ring of an ego network, the first step is to find out whether their semantic profile differs from one ring to another one or, instead, if the distribution is homogeneous over the whole ego network. Since all semantic profiles are based on the same 100 topics, it is easy to obtain a distance measure to compare the rings with one another. Recalling that the semantic profile is a probability distribution, for this purpose we can use the Jensen-Shannon (JS) divergence [49], which allows us to calculate the proximity between the 100-topic distributions that we obtained previously. Then, the corresponding JS distance is conventionally obtained as the square root of the JS divergence [50]. The JS divergence is basically a symmetric version of the well-known Kullblack-Leibler (KL) divergence, which is a standard metric for capturing the distance between probability distributions. For a tagged ego e, the KL divergence DKL between two semantic profiles and of rings i and j for ego network e can be computed as follows:(5)From , the JS divergence can be obtained as:(6)with . Then we go from divergence D to distance δ by taking the square root: . Note that the JS distance is bounded as .
Once we have obtained a , we compute its average across all ego networks in a standard way, i.e.,
5.2.3 Capturing important topics and their cross-rings effects.
Given a semantic profile , we can check whether some topics are more important than others, and, if this is the case, whether they play a special role in the ego network’s rings. We consider whether topics can be divided in two classes, i.e., “important” and “not-important” topics for each ring. To do so, we cluster the topics according to their presence in the specific ring under study, i.e, according to the values of where . To this aim, we use the Jenks algorithm [51] which allows finding natural breaks in the frequency distribution (similarly to k-means, we have to specify k, the number of groups we want to obtain). We rely on the Silhouette score [52] to validate the clustering results. Since we just want to find one natural break that separates important topics from the others, we set k = 2. Words are split into two groups, one with high-frequency use, and the other with low-frequency use. The former is the set of important (or primary) topics referred to as (where e is the ego network and r is the ring number), and the latter is the set of non-important topics as .
Once we have obtained and , for all ego networks and for all rings, we can investigate whether primary topics in one ring play a special role in other rings as well. Let us focus on two rings x and y. We define as the coverage of rx’s primary topics in ring ry. This metric captures the cumulative presence of rx’s primary topics in ry.(7)Then, to capture the average individual strength of rx’s primary topics in ry, we define a complementary metric (with an averaging factor ) as follows:(8)Basically, measures the average share of each rx’s primary topics in another ring of the same ego network. Similarly, we can compute by replacing with in the above equation. This approach can be generalized to more complex cases. For example, we can study the strength of topics that are important in both rx and ry in the semantic profile of ring ry. This would be equivalent to the following:(9)Analogously, we can study the opposite effect, i.e., what is the strength of topics that are important in rx but not in ry in the semantic profile of ry. In this case, the formula will be the following:(10)All the above metrics capture the pulling power of ring rx on ring ry.
Another interesting perspective is whether topics that are primary elsewhere tend to be more or less dominant than the average topic in or . This effect can be measured as follows:(11)where we basically compute the difference between the strength of topics that are primary in both rx and ry and the average strength of all primary topics in ry. The complementary perspective is whether topics that are primary elsewhere tend to be more or less dominant than the average non-primary topic in ry. To this aim, we leverage the following:(12)which follows the same line of reasoning as .
5.3 Results
In this section, we study the semantic profiles in the ego networks of the Twitter users in our four datasets (Section 3).
5.3.1 Ring #1 is special in the ego networks of words.
We start our analysis by studying how topics are associated with the different rings. For each ego network e, we will compute the number of topics per ring ( and , its normalized version) and their entropy H(e, r). These metrics are then averaged across all egos, as described in Section 5.3, and 95% confidence intervals are shown.
In Fig 12(a), we can observe that the number of topics grows towards the external rings (from about 11 in ring #1 to over 16 in ring #6). However, not all rings contain the same number of word occurrences (Fig 12(b)): as seen previously in Section 5.1.2, each word occurrence contributes equally and independently to the calculation of the topics distribution. Therefore, a ring containing more word occurrences is more likely to contain more different topics. When we normalise by word occurrences (), the maximum of the normalised topic count (Fig 12(c)) is observed in the first ring. Thus, ring #1 stands out as the ring that generates proportionally more topics than the other rings.
[Figure omitted. See PDF.]
Average number of topics (a), number of word occurrences (b), and normalised number of topics (c) in each ring of the ego network. For “null” ego networks, we report only the normalised number of topics (d).
In order to validate this hypothesis, we need to rule out that this result is not a mere side effect induced by the structure of the ego networks but it is a tell-tale sign of how humans pick the words in their innermost ring. In other words, we want to test whether keeping the ego network structure unchanged but swapping the words in the rings would still yield the same result regarding ring #1. To this aim, we designed a null model where the ego network structure remains the same but the words are shuffled (more details in the grey box below). In Fig 12(d), we show for the null model of ego networks. Since the maximum of is obtained at a different ring r than in the previous case, we can deduce that ring #1 is special not just as a side effect of the ego network structure but due to the nature of the words it contains. To further confirm this finding, note also that the number of topics per word occurrence is significantly lower for innermost rings in the null model with respect to the outermost rings whereas the opposite is true for real ego networks. This is a second element that hints at the peculiar role of innermost rings in real-life ego networks of words.
To extend our study beyond the mere number of topics per ring, we now investigate the diversity in the way topics are distributed, leveraging the entropy of the semantic profiles defined in Section 5.2.1. This is a way of calculating the semantic diversity of the words that compose a ring, as would be a metric like the average pairwise semantic distance, but based on the semantic profile that we have previously calculated. Fig 13 (left) shows different levels of entropy depending on the rings: H(r) grows towards the outer rings and is significantly lower in the innermost ring (for all datasets). This means that the outermost rings are, on average, semantically richer than the innermost ones. Then, we compare these results with those obtained from the null model (Fig 13 on the right), to find out whether the differences in entropy are related to the intrinsic structure of the ego network. We find that the entropy of the null model is the same as the original model for all rings, but for ring #1, where the null model entropy is lower. This means that, even if words are organized in the ego network such that the diversity of topics grows toward the outermost rings, the diversity in ring #1 is higher than what we could expect if words were randomly assigned to rings, which is consistent with the previous findings of this section.
[Figure omitted. See PDF.]
Real-life ego networks (left) vs null model ego networks (right).
Building a null model of an ego network.
In order to show that the result is not only determined by the structure of the ego network (independently of the word organization inside), we chose to build “null”, artificial ego networks based on those already existing. Let o(wu, e) be the number of occurrences of the word wu in ego e, such that the number of word occurrences in a ring r of a given ego e is defined as:(13) being the set of unique words in ring r. For each ego network, all the words are shuffled (i.e., a new is defined) and the word occurrences are artificially changed (new o′ and O′ are defined) such that the ring sizes and the number of occurrences are kept unchanged:(14)The shuffling process can be considered as a succession of random swaps of words in the ego network. Let us consider a word wx with X occurrences in ring rx, and another word wx with Y occurrences in ring ry. During the shuffling process, assume the two words are swapped. In that new ego network, the number of occurrences of wx is forcibly set to the original number of occurrences of wy and vice versa:(15)That way, we can preserve Eq (14). Words are shuffled along with their topic distribution in the original dataset. This topic distribution associated to a unique word wu is calculated based on its occurrence . Each of these word occurrences w is associated with a topic such that Pm(wc)(c) = 1. Hence, simply corresponds to the ratio of the occurrences of wu that are associated to c.(16)Then the new topic distribution of a given ring r is the weighted average of the topic distribution of the unique words that compose that ring after shuffling(17)The full process is summarized with a toy example in Fig 14.
[Figure omitted. See PDF.]
The ring sizes and word occurrences are kept, the words are shuffled. In this toy example: O(e, r2) = 3 + 2, o(virus, e) = 5, o′(virus, e) = 1.
We now carry out a pairwise comparison of the semantic profiles of rings, using the JS distance described in Section 5.2.2. we plot the, in Fig 15. As one can expect, the diagonal is filled with zeros since the distance is calculated between two identical semantic profiles, and the upper triangle mirrors the lower triangle since the distance is symmetric. All datasets exhibit the same features:
* The first row and column always contain the higher values. This means that ring #1 (i.e. the innermost ring) is always the most distant from the other rings. In other words, ring #1 is the most characteristic ring.
* The lower values are always the distance between ring #5 and #6. Thus, the pairs of most similar rings are always among the outermost ones.
* For one row or column, the lowest value is always neighbouring the diagonal: given one ring x, the least distant ring is always the previous ring x−1 or the following one x+ 1. This means that two rings close to each other are more likely to be similar.
[Figure omitted. See PDF.]
Average JS distance between the rings.
The first observation is very important because it shows that the topic distribution associated with the most used words (those in the innermost ring) by a Twitter user is different from that associated with the least used words. This makes ring #1 unique in two ways. It generates proportionally more topics than the others rings (Fig 12(c)), but the distribution in ring #1 is the furthest away from the others (Fig 15). This hints at a significantly higher “semantic generative role” of inner rings as opposed to outer ones: each word occurring in an inner ring is able “generate” more topics on which the user engages. And these topics, on which that user focuses most (inner rings feature higher frequency of use of words) generate a distribution that is quite distinct from the one at the outermost rings, on which the user engages far less.
Take home message for Section 5.3.1: Ring #1 is special in the ego network of words: it generates proportionally more topics than the other rings, its topic diversity is proportionally higher than expected, and its semantic profile is the most different with respect to the other rings. This suggests that ring #1 may be the semantic fingerprint of the ego network of words.
5.3.2 The role of primary topics from ring #1.
In the previous section, we discovered that ring #1 is special. It, therefore, makes sense to investigate which topics are most important in this ring and if they tend to be equally important in the other rings. This will allow the reader to familiarize themselves with the methodology as well, before generalizing the analysis to other rings in Section 5.3.3.
We measure the overall importance of r1’s primary topics in another ring ry by computing (see Section 5.2.3), varying ry from innermost to outermost layer. Fig 16 shows the coverage of r1’s primary topics in the other rings, across all the ego networks. corresponds to the blue bars in the figure. accounts for approximately 50% of each ring and of the whole ego network (last bar). This small (5–6, on average) set of topics, which fills almost the entire innermost ring, is playing a big role in the entire ego network as well.
[Figure omitted. See PDF.]
Each bar stands for the semantic profile of each ring (and overall ego network, in the last bar), where the blue part represents the share covered by the most important topics of ring #1 (their average number is written in white).
To verify if the reverse statement is true (i.e., if topics that are important in the whole ego network are also important in ring #1), we build a new set of topics Ue grouping the most important topics in the whole ego network and calculate . Fig 17 highlights the coverage of those topics across the rings. Although, in general, all primary topics at the level of the ego network are well represented in all rings, we observe a slight predominance in ring #1, as the innermost ring contains the biggest share of the most important topics of the ego network. This means that topics that are important to the ego network are over-represented in the innermost ring, i.e., an important topic discussed by a Twitter user is very likely to belong to .
[Figure omitted. See PDF.]
The blue part of the stacked bar represents the share covered by the important topics in Ue. The average number of topics |Ue| is specified in white.
Take home message for Section 5.3.2: Both results from Figs 16 and 17 indicate a close relation between important topics in ring #1 and those important for the whole ego network. This observation is all the more interesting as ring #1 is semantically the most different from all the others (Section 5.3.1), confirming the special role of this ring in the ego network of words.
5.3.3 Pulling power of primary topics.
Let us now focus on the primary topics in a generic ring rx (i.e., those in ). They can also appear in another ring ry, and can be found in either or . In the first case, the topics are primary in both rings, in the latter they are primary only in rx. We now tackle the following problem: which is the ring whose primary topics are most dominant among the primary topics of another ring? This involves measuring the strength, in the semantic profile of ry, of the topics that are important for both ry and rx. Using the notation of Section 5.2.3, this is equivalent to studying for all possible pairs of rx, ry. We show on the left side of Table 6. The diagonal is left blank for the sake of clarity (we are interested in the results when rx ≠ ry). For a given ry, the largest value is written in bold. We can clearly observe that the primary topics that are also primary in r1 have almost always the largest share in the semantic profiles of the rings. Beyond the fact that the sum of important topics in ring #1 is also important in the other rings (Section 5.3.2), the table shows that they are on average the most likely to be important in all the other rings.
[Figure omitted. See PDF.]
On the left, for all rx, ry pairs in our datasets. On the right, . In bold, the highest value per column, corresponding to the rx for which the pulling power is higher in ry.
Now we tackle the complementary question: what is the pulling power of primary topics in a ring on the non-primary topics in another ring? We measure this via , which is shown in the right part of Table 6.
From the left side of Table 6, we know which is the ring whose primary topics have the highest pulling power on the primary topics of others. But do they have a higher than average strength with respect to the primary topics in the ring as a whole (i.e., regardless of whether they are primary in other rings or not)? To investigate this problem, we show in Table 7. In the table, all the numbers are positive. This means that, on average, among the most important topics for a ring ry, if a topic belongs to the important topics of another ring rx, its strength will be more likely to be higher than the average strength of generic important topics in ry. A t-test has been performed to assess whether these differences are statistically significant: in all cases, we obtained p-value < .001. On the right side of the table we show , which captures whether topics that are primary elsewhere but not in ry tend to have a higher share among the least important topics in ry. In this case, too, the numbers are positive. It also means that, on average, among the least important topics of a given ring ry, a topic is more likely to have a higher strength if it belongs to the important topics in another ring rx. Again, the p-values are smaller than .001, confirming that such results are not due to statistical fluctuations.
[Figure omitted. See PDF.]
On the left, for all rx, ry pairs in our datasets. On the right, . The highest value per column is in bold.
Take home message for Section 5.3.3: Studying the role of primary topics, we have learned the following.
* Primary topics from ring #1 tend to dominate among the primary topics of other rings. This shows the pulling power of the innermost ring, confirming its special role in the ego network. Vice versa, primary topics from ring #1 do not seem to dominate among non-primary topics of other rings.
* The topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of another ring. This effect is especially acute when considering primary topics from ring #1 with respect to generic primary topics in other rings.
5.3.4 Discussion.
The study of the semantic profile of the rings of the ego network confirms the relevance of the ego network of words model. This model allowed us to isolate the specific features of the topics associated with the words in the innermost ring. Indeed, the semantic profile in ring #1 is not only the most unique (the most semantically distant from the others), but it is also characterized by both a larger than expected entropy distribution and number of topics generated, when compared with a null model. The most important topics that ring #1 is composed of are not only a set of important topics in the other rings: for every ring, an important topic is more likely to be predominant if it is also important in the innermost ring. Hence, despite the small number of unique words and word occurrences it contains, the innermost ring strongly “predicts” the most important topics in the entire ego network. In light of these results, we can conclude that the semantic profile of the innermost ring r1 is also the semantic fingerprint of the whole ego network of words.
As it has been done with social ego networks (using structural properties to study information diffusion [16], or to perform link prediction [53]), we can use the structural and semantic invariants of the ego network of words to investigate some classical data science problems, with a focus on natural language processing. This semantic fingerprint could be used to identify specific Twitter users, or groups of users, with a non-trivial interest distribution for certain topics (e.g. a mix of important topics in the innermost rings and marginal topics in the outermost rings). It could also be used for link prediction with the assumption that users with the same topic of interest in the innermost ego network circles are more likely to follow one another (this is the principle of homophily) or for the purpose of word recommendation in a typing assistance tool. Since we identified some semantic invariants (eg. the role of important topics in ring #1), we could leverage this property to identify outliers deviating from the standard and detect non-human behaviors. Finally, we could use the fact that ring #1 contains the important topics of the entire ego network to spare some time considering only the words in this innermost ring, within the context of topic mining.
6 Conclusion
Inspired by previous work modeling the cognitive constraints that regulate personal social relations, in this paper, we investigate, through a data-driven approach, whether a regular structure can also be found in the way people use words, as a symptom of cognitive constraints in their mental process. Based on a corpus of tweets written by both regular and professional users, we have shown that, similarly to the social case, a concentric layered structure (which we name “ego network of words”) very well captures how an individual organizes their cognitive effort in language production and reveals some structural invariants in the way people organise their own vocabulary. Among these invariants, we can list (i) the number of layers (between 5 and 7), (ii) their regular growth from the center of the word ego network outward (the innermost layer is five times smaller than the following one, for all the other layers their size approximately double moving outward), (iii) the size of external layers (which is pretty stable, with the two penultimate layers accounting respectively for 30% and 60% of the words in the model, regardless of the total number of layers).
Then, going beyond words as units of language, we performed a semantic analysis of the ego network of words. Each ring of each ego network is described by a semantic profile that captures the topics associated with the words in the ring. We have found that ring #1 has a special role in the model. It is semantically the most dissimilar out of the six, and also the one which generates proportionally the largest number of topics. We also showed that the topics that are important in the innermost ring, also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words. Finally, we found that the topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of the other rings. This shows that, while layer #1 provides a particularly strong signal about prevalence in the ego networks, weaker signals show a more complex structure of influence among topics “resident” in different layers of the ego network of words.
Supporting information
S1 Appendix. Supplementary information on the structural and semantic analysis of word ego networks.
In this appendix we provide additional information regarding the data preprocessing, the soft clustering analysis, and we include additional tables to support the findings in the paper.
https://doi.org/10.1371/journal.pone.0277182.s001
(PDF)
Citation: Ollivier K, Boldrini C, Passarella A, Conti M (2022) Structural invariants and semantic fingerprints in the “ego network” of words. PLoS ONE 17(11): e0277182. https://doi.org/10.1371/journal.pone.0277182
About the Authors:
Kilian Ollivier
Roles: Conceptualization, Data curation, Formal analysis, Investigation, Software, Visualization, Writing – original draft
E-mail: [email protected]
Affiliation: CNR-IIT, Pisa, Italy
https://orcid.org/0000-0003-2881-5845
Chiara Boldrini
Roles: Conceptualization, Methodology, Writing – original draft
Affiliation: CNR-IIT, Pisa, Italy
https://orcid.org/0000-0001-5080-8110
Andrea Passarella
Roles: Conceptualization, Methodology, Writing – original draft
Affiliation: CNR-IIT, Pisa, Italy
Marco Conti
Roles: Conceptualization, Methodology
Affiliation: CNR-IIT, Pisa, Italy
1. Levelt WJ, Roelofs A, Meyer AS. A theory of lexical access in speech production. Behavioral and brain sciences. 1999;22(1):1–38. pmid:11301520
2. Broadbent DE. Word-frequency effect and response bias. Psychological review. 1967;74(1):1. pmid:5341440
3. Qu Q, Zhang Q, Damian MF. Tracking the time course of lexical access in orthographic production: An event-related potential study of word frequency effects in written picture naming. Brain and language. 2016;159:118–126. pmid:27393929
4. Dunbar R. The social brain hypothesis. Evolutionary Anthropology. 1998;9(10):178–190.
5. Dunbar RIM, Sosis R. Optimising human community sizes. Evolution and human behavior: official journal of the Human Behavior and Evolution Society. 2018;39(1):106–111. pmid:29333060
6. Hill RA, Dunbar RI. Social network size in humans. Human nature. 2003;14(1):53–72. pmid:26189988
7. Zhou WX, Sornette D, Hill Ra, Dunbar RIM. Discrete hierarchical organization of social group sizes. Proceedings Biological sciences / The Royal Society. 2005;272(1561):439–444. pmid:15734699
8. Dunbar RI, Arnaboldi V, Conti M, Passarella A. The structure of online social networks mirrors those in the offline world. Social networks. 2015;43:39–47.
9. Haerter JO, Jamtveit B, Mathiesen J. Communication dynamics in finite capacity social networks. Physical review letters. 2012;109(16):168701. pmid:23215144
10. Miritello G, Moro E, Lara R, Martínez-López R, Belchamber J, Roberts SGB, et al. Time as a limited resource: Communication strategy in mobile phone networks. Social Networks. 2013;35(1):89–95.
11. Gonçalves B, Perra N, Vespignani A. Modeling users’ activity on twitter networks: Validation of dunbar’s number. PloS one. 2011;6(8):e22656. pmid:21826200
12. Sutcliffe A, Dunbar R, Binder J, Arrow H. Relationships and the social brain: integrating psychological and evolutionary perspectives. British journal of psychology. 2012;103(2):149–168. pmid:22506741
13. Dunbar R. Theory of mind and the evolution of language. Approaches to the Evolution of Language. 1998;.
14. Brysbaert M, Stevens M, Mandera P, Keuleers E. How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age. Frontiers in Psychology. 2016;7(JUL):1116. pmid:27524974
15. Zipf GK. Human behavior and the principle of least effort. Addison-Wesley Press; 1949.
16. Arnaboldi V, Conti M, Passarella A, Dunbar RI. Online social networks and information diffusion: The role of ego networks. Online Social Networks and Media. 2017;1:44–55.
17. Ollivier K, Boldrini C, Passarella A, Conti M. Structural Invariants in Individuals Language Use: The “Ego Network” of Words. In: Aref S, Bontcheva K, Braghieri M, Dignum F, Giannotti F, Grisolia F, et al., editors. Social Informatics. Cham: Springer International Publishing; 2020. p. 267–282.
18. Piantadosi ST. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review. 2014;21(5):1112–1130.
19. Anderson JR, Schooler LJ. Reflections of the environment in memory. Psychological science. 1991;2(6):396–408.
20. Graesser A, Mandler G. Limited processing capacity constrains the storage of unrelated sets of words and retrieval from natural categories. Journal of Experimental Psychology: Human Learning and Memory. 1978;4(1):86.
21. Aramaki E, Shikata S, Miyabe M, Kinoshita A. Vocabulary size in speech may be an early indicator of cognitive impairment. PloS one. 2016;11(5):e0155195. pmid:27176919
22. Abel F, Gao Q, Houben GJ, Tao K. Analyzing user modeling on twitter for personalized news recommendations. In: international conference on user modeling, adaptation, and personalization. Springer; 2011. p. 1–12.
23. Bhattacharya P, Zafar MB, Ganguly N, Ghosh S, Gummadi KP. Inferring user interests in the twitter social network. In: Proceedings of the 8th ACM Conference on Recommender systems; 2014. p. 357–360.
24. Frasincar F, Borsje J, Levering L. A semantic web-based approach for building personalized news services. International Journal of E-Business Research (IJEBR). 2009;5(3):35–53.
25. Arslan O, Xing W, Inan FA, Du H. Understanding topic duration in Twitter learning communities using data mining. Journal of Computer Assisted Learning. 2022;38(2):513–525.
26. Guille A, Favre C. Mention-anomaly-based event detection and tracking in twitter. In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014). IEEE; 2014. p. 375–382.
27. Davis CA, Varol O, Ferrara E, Flammini A, Menczer F. Botornot: A system to evaluate social bots. In: Proceedings of the 25th international conference companion on world wide web; 2016. p. 273–274.
28. Varol O, Davis CA, Menczer F, Flammini A. Feature engineering for social bot detection. In: Feature engineering for machine learning and data analytics. CRC Press; 2018. p. 311–334.
29. Boldrini C, Toprak M, Conti M, Passarella A. Twitter and the press: an ego-centred analysis. In: Companion Proceedings of the The Web Conference x2019;18; 2018. p. 1471–1478.
30. Diaz MT, McCarthy G. A comparison of brain activity evoked by single content and function words: an fMRI investigation of implicit word processing. Brain research. 2009;1282:38–49. pmid:19465009
31. Friederici AD, Opitz B, Von Cramon DY. Segregating semantic and syntactic aspects of processing in the human brain: an fMRI investigation of different word types. Cerebral cortex. 2000;10(7):698–705. pmid:10906316
32. Honnibal M, Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing; 2017.
33. Loper E, Bird S. Nltk: The natural language toolkit. arXiv preprint cs/0205028. 2002;.
34. Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory. 1975;21(1):32–40.
35. Jenks GF. Optimal data classification for choropleth maps. Department of Geographiy, University of Kansas Occasional Paper. 1977;.
36. MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. vol. 1. Oakland, CA, USA; 1967. p. 281–297.
37. Perfetti CA, Wlotko EW, Hart LA. Word learning and individual differences in word learning reflected in event-related potentials. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2005;31(6):1281. pmid:16393047
38. Senel L K UI, Yucesoy V KA, T C. Semantic Structure and Interpretability of Word Embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018;.
39. Jonnalagedda N, Gauch S. Personalized news recommendation using twitter. In: 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT). vol. 3. IEEE; 2013. p. 21–25.
40. Abu-Salih B, Wongthongtham P, Chan KY. Twitter mining for ontology-based domain discovery incorporating machine learning. Journal of Knowledge Management. 2018;.
41. Mežnar S, Bevec M, Lavrač N, Škrlj B. Link Analysis meets Ontologies: Are Embeddings the Answer? arXiv preprint arXiv:211111710. 2021;.
42. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
43. Grootendorst M. BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.; 2020. Available from: https://doi.org/10.5281/zenodo.4381785.
44. McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018;.
45. McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering. 2017 IEEE International Conference on Data Mining Workshops (ICDMW). 2017.
46. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision; 2015. p. 19–27.
47. Radovanovic M, Nanopoulos A, Ivanovic M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research. 2010;11(sept):2487–2531.
48. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(11).
49. Lin J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory. 1991;37(1):145–151.
50. Osterreicher F, Vajda I. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics. 2003;55(3):639–653.
51. Jenks GF. The data model concept in statistical mapping. International yearbook of cartography. 1967;7:186–190.
52. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987;20:53–65.
53. Toprak M, Boldrini C, Passarella A, Conti M. Harnessing the Power of Ego Network Layers for Link Prediction in Online Social Networks. IEEE Transactions on Computational Social Systems. 2022;.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 Ollivier et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Well-established cognitive models coming from anthropology have shown that, due to the cognitive constraints that limit our “bandwidth” for social interactions, humans organize their social relations according to a regular structure. In this work, we postulate that similar regularities can be found in other cognitive processes, such as those involving language production. In order to investigate this claim, we analyse a dataset containing tweets of a heterogeneous group of Twitter users (regular users and professional writers). Leveraging a methodology similar to the one used to uncover the well-established social cognitive constraints, we find regularities at both the structural and semantic levels. In the former, we find that a concentric layered structure (which we call ego network of words, in analogy to the ego network of social relationships) very well captures how individuals organise the words they use. The size of the layers in this structure regularly grows (approximately 2-3 times with respect to the previous one) when moving outwards, and the two penultimate external layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of layers of the user. For the semantic analysis, each ring of each ego network is described by a semantic profile, which captures the topics associated with the words in the ring. We find that ring #1 has a special role in the model. It is semantically the most dissimilar and the most diverse among the rings. We also show that the topics that are important in the innermost ring also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer