Content area
Purpose
>Reorganising unstructured academic abstracts according to a certain logical structure can help scholars not only extract valid information quickly but also facilitate the faceted search of academic literature. This study aims to build a high-performance model for identifying of the functional structures of unstructured abstracts in the social sciences.
Design/methodology/approach
>This study first investigated the structuring of abstracts in academic articles in the field of social sciences, using large-scale statistical analyses. Then, the functional structures of sentences in the abstract in a corpus of more than 3.5 million abstracts were identified from sentence classification and sequence tagging by using several models based on either machine learning or a deep learning approach, and the results were compared.
Findings
>The results demonstrate that the functional structures of sentences in abstracts in social science manuscripts include the background, purpose, methods, results and conclusions. The experimental results show that the bidirectional encoder representation from transformers exhibited the best performance, the overall F1 score of which was 86.23%.
Originality/value
>The data set of annotated social science abstract is generated and corresponding models are trained on the basis of the data set, both of which are available on Github (https://github.com/Academic-Abstract-Knowledge-Mining/SSCI_Abstract_Structures_Identification). Based on the optimised model, a Web application for the identification of the functional structures of abstracts and their faceted search in social sciences was constructed to enable rapid and convenient reading, organisation and fine-grained retrieval of academic abstracts.
1. Introduction
With the increase in the amount of academic literature, people can now obtain massive amounts of information on various topics. However, it is found that 97% of the abstracts in the Social Science Citation Index (SSCI) for the period 2008–2020 are unstructured, suggesting that the percentage of structured abstracts in the social sciences is low and that scholars are restricted from effective access to required information. Moreover, identification of the functional structures of unstructured abstracts can make abstracts more readable and understandable. Therefore, there is a real need for the effective organisation and management of abstracts in the social sciences.
Abstracts in scientific literature are academic texts with a strict logic, and they contain extensive information. Sentences in abstracts have a certain functional structure, which serves to illustrate the research’s background, purpose, methods, results and conclusions. Meanwhile, research shows that the functional structure identification of sentences in the abstract has a great significance for the organisation and retrieval of academic literature (Guo et al., 2011; Tbahriti et al., 2006; Yepes et al., 2013).
In recent years, research on organizing scientific literature and annotating its functional structures has been increasing, and a number of data sets have emerged, such as the functional structures of unstructured abstracts data set in the field of COVID-19 (Huang et al., 2020) and structured scientific publications summarization data sets (Gidiotis and Tsoumakas, 2019; Jaidka et al., 2018; Meng et al., 2021). Most of these data sets annotate the functional structures of unstructured abstracts as background, purpose, method, finding and other. The structured abstracts, from the point of view of the discipline of data sets, are mainly natural science, while the size of the data set is relatively small.
Because research on the identification of the functional structures of abstracts mainly focuses on the fields of biomedical science and computer science, and the scale of data is small, in the field of social sciences, there are few studies on the identification of the functional structures of abstracts covering many disciplines. This study, by using sentence classification and sequence tagging, extracted structured abstracts in the social sciences from 2008 to 2020 and applied the latest pre-training deep learning model; namely, the bidirectional encoder representation from transformers (BERT), to the identification of their functional structures. Based on the optimised model, a Web application for the identification of the functional structures of abstracts and their faceted search in the social sciences was constructed to enable rapid and convenient reading, organising and fine-grained retrieval of abstracts.
2. Literature review
Previous research mainly involved the following two specific areas: one is the recognition methods of the functional structures of sentences in abstract, and the other is the application of sentence functional structures in scientific literature.
2.1 Methods of identification of the functional structures of abstracts
The identification of the functional structures of abstracts refers to the guidance based on genre analysis in linguistics, of which the IMRD (introduction, methods, results and discussions) four-step model is the most representative (Bhatia, 1993; Graetz, 1982; Swales, 1990). Relating to the sequential pattern of scientific paper writing, the IMRD model is organised following the order of introduction, methods, results and discussions and corresponds to the most widely known quadrinomial text structure of scientific and technical papers (Sollaci and Pereira, 2004). Relevant studies can be classified into two categories from two perspectives, namely, sequence tagging and sentence classification, in which machine learning and deep learning techniques are widely used while the biomedical and computer science has long been a focus of interest for researchers. From the perspective of sentence classification, earlier studies mostly used the sentence classification method of integrating features, such as lexical, syntactic, and location attributes and keywords. The specific classifiers included such machine learning models as naïve Bayes and support vector machines (SVM) (Guo et al., 2010; Liu et al., 2013; McKnight and Srinivasan, 2003; Nam et al., 2016; Ruch et al., 2007; Yamamoto and Takagi, 2005). The data sets used in the above research are mainly from a single discipline and a large number of features extracted from abstracts are used in the process of the machine model construction. In terms of sequence tagging, sentences in an abstract follow the writing sequence of background-methods-results-conclusions. There is a dependency relationship between sentences, illustrating that the sequence tagging model is applicable to the identification of the functional structure of sentences in the abstract. Furthermore, it mainly adopts the conditional random field (CRF) model (Hirohata et al., 2008; Kim et al., 2011; Lin et al., 2008) with the features incorporating aspects, such as part-of-speech, entity, position, word frequency and semantics, thus achieving good performance. Combined with the characteristics of the CRF model, the detailed and comprehensive feature templates are developed to identify the functional structure of sentences in the abstract. Although the number of the trained data sets for the CRF model is small on the whole, its overall performances are excellent. However, the proposed machine learning model requires the artificial creation of many features, which is relatively time-consuming and labour-intensive, and new data that have been included in the model require the features to be recreated. With improvements in computer performance, the deep learning model has been the main method for the identification of the functional structures of abstracts, the great advantages of which are that there is no need for the artificial creation of features, and the performance on the scale of big data is superior to the traditional machine learning model. Dernoncourt et al. (2016) designed a long short-term memory (LSTM) model and applied it to the identification of the functional structures of abstracts in medical science, with the F1 scores being up to 89.9%. Gonçalves et al. (2020) proposed the Word-BiGRU model, which identifies the functional structures of abstracts in the fields of biomedical science and computer science from the perspective of sentence classification, with the F1 scores being up to 91%. Based on the deep learning model, Jin and Szolovits (2018) added a context enhancement mechanism and adopted the perspective of sequence tagging identification in the field of the functional structures of abstracts in biomedical science, with F1 scores up to 93.9%. Compared with traditional machine learning, deep learning enjoyed considerable advantages in terms of the identification of the functional structures of abstracts, especially when it comes to large data sets. However, when identifying the functional structures of abstracts based on deep learning models, how to integrate the pre-training model into research studies has a great promotion effect on the performance of the constructed model.
In summary, the methods of identifying functional structures of abstracts range from machine learning methods with artificial features to deep learning methods without artificial features, and are leading to continuous performance improvement. However, current research fails to involve the identification of the functional structures of abstracts in the field of social sciences and only focuses on the disciplines of biomedical science and computer science.
2.2 Functions and applications of sentences in scientific literature
The main applications of sentence functions in scientific literature include meeting users’ fine-grained retrieval needs, extractive abstracts, knowledge organization and the analysis of scientific literature. Sentences in scientific literature are mainly citation sentences and their context, as well as sentences containing significant entity knowledge. Liu et al. (2014) extracted contextual sentences from quotes, and provided search services based on quotes context, finding that their retrieval outperformed Google Scholar and PubMed. By analysing citation contexts to rank articles, Doslu and Haluk (2016) used common ranking algorithms to find the most popular articles. Saini et al. (2011), with the metaheuristic evolutionary algorithm, identified the relevant sentences in articles using citation context to construct the summarization of scientific documents in the computational linguistic domain.
In the field of computer science, Safder and Hassan (2019) extracted sentences describing algorithms from the ACL academic full text research corpus and provided services for searching academic papers in the field of computer science. For the retrieval of mathematical formulas in scientific literature, Greiner-Petter et al. (2020) extracted sentences describing mathematical formulas, applied word embedding to formulas, proposed math-word embedding, and found that this method has a certain feasibility in terms of formula similarity calculation and analogy. Cited sentence identification not only provides retrieval services but also enriches the calculation of impact factors and automatically generates better abstracts. Teufel et al. (2006) extracted the linguistic features of cited sentences by classifying their functions, and found that the functions had an association with sentiment classification. Ding et al. (2013) divided the structure of the text into the introduction, literature review, method, results and conclusions chapters, and they investigated the citation distribution of different chapter structures. By visualizing and analysing the distributions of citations in articles, including introduction, method, results and conclusions, Hu et al. (2013) investigated the distributions of citation locations. For improved abstract extraction, function classification and identification of cited sentences were performed from perspectives, such as language rhetoric, function structure and deep learning (Abu-Jbara and Radev, 2011; Dong and Schäfer, 2011; Teufel and Moens, 2002; Zerva et al., 2020).
Zhang (2012) found that organising literature according to a certain functional structure could enhance the effectiveness of the reading process. Zone identification of scientific literature can effectively and automatically organise scientific literature. Asadi et al. (2019) adopted the two-layer zone identification method on the basis of machine learning and characteristic extraction and obtained a good identification performance. Lu et al. (2018) adopted perspectives, such as chapter title and paragraph, to identify the functional structure of academic full texts in the field of computer science and found that it was feasible to apply a functional structure to academic information retrieval and keyword extraction. Heffernan and Teufel (2018) used a machine learning model to identify sentences describing research issues and solutions in the identification of scientific literature to provide a new way of reading and organising academic literature for researchers. To sum up, the functional application of sentences in scientific literature provides a reference for functional structure application of sentences in the texts of abstract, such as fine-grained scientific literature retrieval and the organisation of knowledge.
3. Data
3.1 Social science structured analysis
The academic abstracts in this research were obtained from Web of Science (WOS). The SSCI covers 2,474 mainstream social science academic journals around the world and can be divided into 57 disciplines (http://mjl.clarivate.com/cgi-bin/jrnlst/jlsubcatg.cgi?PC=SS). This research acquired a total of 3,510,332 academic abstracts included in SSCI from 2008 to 2020. There were only 121,524 structured abstracts. As shown in equation (1), by computing the ratio of the structured abstracts, it can be seen that 3.462% of the abstracts are structured, suggesting that the identification of the functional structures of abstracts in the social sciences is urgently needed.
As mentioned above, structured abstracts are widely popular in the medical field. As shown in Figure 1, in the fields of substance abuse and rehabilitation, the ratios of structured abstracts are 19.175% and 16.389% respectively, while for disciplines such as history of social sciences and mathematical psychology there are no structured abstracts.
3.2 Analysis of the functional structures of sentences in abstracts
Instead of annotating sentences, the existing functional structures tags on the WOS data set are classified and organized. A structured abstract has a tag before each sentence, which is used to represent the functional structure of the sentences. Finally, a total of 121,524 structured abstracts, 1,378,276 sentences and 508 different tags were obtained. It is impossible to organise abstracts effectively by using these tags directly as many tags have the same meaning, For instance, Methods, Design/methodology/approach, and Design are all used to indicate that a sentence belongs to the functional structure of research methods in the abstract. Therefore, similar tags were artificially incorporated. In the process of annotation, a total of nine experts proofread the data, and two groups of postgraduates merged the same data set, respectively. If the results were inconsistent, they will be referred to a third party for arbitration. Finally, the data set that includes five functional elements – that is, background, purpose, methods, results and conclusions – is built. An example of an annotated social science abstract from the data set is shown in Figure 2.
Table 1 shows the category distribution of the functional structures of abstracts, where the occurrence percentages of conclusions and methods are the highest, accounting for 24.38% and 24.37%, respectively, while the occurrence percentages of purpose and background are lower, 11.67% and 17.82%, respectively.
An analysis of the transition probability distribution of the functional structures of abstracts revealed their potential pattern. As shown in Figure 3, “START” and “END” denote particular tags, indicating the starting and ending tags of an abstract; B denotes Background, P Purpose, M Methods, R Results and C Conclusions. It can be seen that the initial sentences of abstracts in general are Background or Purpose, and ending sentences are Conclusions. There is a writing sequence similar to background-purpose-methods-results-conclusions or background-methods-results-conclusions, illustrating that the texts of abstract have sentence dependency. Therefore, this research not only built up a sentence classification model but also constructed a sequence tagging model that can be applicable to the segmentation of abstract and the annotation of their functional structure.
The sentence classification model first needs to divide the abstract into sentences and predicts the functional structure of the abstract in terms of sentence units. The vocabulary and syntax of each sentence can be used as important features of sentence classification, but the model cannot take advantage of the dependency between sentences. However, the sequence tagging model uses the entire abstract as a unit to make predictions and, thus, it can learn the dependencies between sentences.
4. Methods
4.1 Sequence tagging model
4.1.1 Conditional random field.
The CRF model is essentially a sequence tagging model. This research transformed the identification of the functional structures of abstracts to sequence tagging, where
For a given x (i.e. under the premise of abstract text), the CRF to model the functional structures of abstracts tag sequence is used, which takes the formula shown in equations (2) and (3). tk, sl denote a binary feature function, used for the characteristic extraction of abstract text; λk, μl are the weights of the feature function, which are dynamically adjusted when training the model for the identification of the functional structures of abstracts, while Z(x) is a normalisation factor, indicating the score of all tags of the current functional structures of abstracts and being used for ensuring that the conditional probability of P(y|x) is at the interval of 0–1.
4.1.2 Bidirectional long short-term memory conditional random field.
LSTM (Hochreiter and Schmidhuber, 1997) is a special recurrent neural network. Bidirectional (Bi) LSTM can extract context information, and thereby improve the performance of sequence tagging. The formal description of LSTM is shown in equations (4) through (9). Herein, it, ot, ft and ct denote, respectively, the control matrices of input gate, output gate, forget gate and memory cell of abstract text at instant t for abstract text input, output, forgetting and memory. xt and ht denote the embedding vector abstract of the tth word of text, and the hidden unit vector at “t” instant, respectively. w and b denote the weight vectors of abstract text information to be trained. σ denotes the sigmoid activation function:
To model the functional structure tag sequence and reduce the error of independently predicting the tag sequence, this study included the CRF layer (Huang et al., 2015), with the tag state score representing the label probability value corresponding to the nth word in the abstract. The formal description is shown in equation (10), where
4.1.3 Bidirectional gate recurrent unit conditional random field.
The gate recurrent unit memory (GRU) (Dey and Salemt, 2017) is a special recurrent neural network. Equations (11) through (14) show the neural unit calculation of GRU, where zt and rt are control matrices of update gate and reset gate at instant t, respectively, used for the update and reset of abstract text information. xt and ht denote the embedding vector of tth word of abstract text and the hidden unit vector at instant t, respectively. w and b denote the weight vectors of information about abstract text to be trained. σ is the sigmoid activation function. The global structure of the model is the same as that of Bi-LSTM-CRF, but the LSTM layer is replaced with the GRU layer. Compared with the Bi-LSTM-CRF model, the parameters of GRU in the task of the identification of functional structures of abstracts are fewer; the training time is shorter and a performance similar to that of Bi-LSTM-CRF can be achieved:
4.2 Sentence classification model
4.2.1 Support vector machine.
The SVM model is a classification model based on statistical theory. The basic idea is to construct a hyperplane as a decision plane to maximise the distance between positive and negative modes. This model has wide applications in terms of sentence classification. In this research, it was applied to the task of the identification of sentences’ functional structure. Firstly, the sentences were transformed into a term frequency-inverse document frequency (TF-IDF) vector. Then, the SVM model mapped the vectorised sentences TF-IDF vector to a high-dimensional feature space by nonlinear mapping and converted the nonlinear classable problem in the original sample space into a linear classable problem in this feature space by using kernel functions to avoid the curse of dimensionality and reduce computational complexity. Applying a margin maximisation learning strategy to adjust the model parameters enables the classification of sentences’ functional structure.
4.2.2 Text convolutional neural network.
The text convolutional neural network (TextCNN) model (Kim, 2014) is a sentence classification model based on CNNs. First, through a word embedding layer, each word of the sentences is mapped into a word embedding representation to form a word embedding matrix of sentences. Then, through a convolution layer with different filters, the sentences’ word embedding matrix can be scanned and calculated in the manner of a sliding window, which is similar to extracting N-gram, thus obtaining sentences’ convolutional semantic vectors. Next, through a pooling layer, the most effective features in the sentences’ convolution features for sentences’ functional structure classification can be extracted by max pooling, thereby obtaining the sentences’ pooling semantic vectors. Finally, the Softmax layer enables classifying and predicting the semantic vector features of sentences’ pooling to obtain the sentences’ functional structure and adjusting the TextCNN model parameters according to the prediction results.
4.2.3 Bidirectional encoder representation from transformers.
BERT (Devlin et al., 2018) is an improved deep language representation model based on the bidirectional language model. It is completely based on self-attention mechanism transformer structure to model the sentences in abstracts. The advantage of BERT, compared to other neural network models, is that a large-scale unsupervised corpus is adopted for pre-training. When applying the sentences’ functional structure identification task, the initial parameters of the entire training process are obtained from the pre-training model. First, abstract sentence word embedding, sentence embedding and position embedding vectors are added; and then, through self-attention multi-layers, the semantic vectors of sentences are made available; finally, through the Softmax layer, the functional structure of sentences is classified and predicted. Model parameters can be fine-tuned according to the prediction results.
5. Experiment
5.1 Data processing
This research dealt with WOS social science abstract data from the perspectives of sentence classification and sequence tagging. For the sentence classification model, the Stanford natural language processing toolkit (Manning et al., 2014) participator was used to divide the structured abstract paragraphs into sentences. A total of 1,378,276 sentences were derived. The functional structure of each sentence was annotated; that is, B denotes Background, P Purpose, M Methods, R Results and C Conclusions. Specific samples are shown in Figure 4; tags and sentences are separated by tabs.
As for the sequence tagging model, the entire abstract was treated as a sequence segmented by words. A total of 33,177,505 words were segmented. The authors used B-I-E prefixes to mark the starting word, middle word and ending word of each functional structure element. For example, B-P indicates the starting word of the “purpose” functional structure, I-P the middle word of “purpose” and E-P the ending word of “purpose”. Specific samples are shown in Figure 5, where the first column represents the word, and the second column contains the corresponding annotation of the word.
5.2 Model parameters and experiment environments
CRF and SVM were used as base line models in this research, where CRF was implemented by the open source toolkit CRF++0.58 (Kudo, 2005). The SVM was implemented using the sklearn library in Python (Pedregosa et al., 2011), where the kernel function was the radial basis function (RBF); the penalty coefficient was set to C = 2.0, gamma = 0.5, and text representation used TF-IDF.
For the deep learning models (Bi-LSTM-CRF and Bi-GRU-CRF) of sequence tagging, the parameter settings of both models were kept the same, while the only difference was between the neurons, where one was LSTM and the other GRU. The model mainly contained a word embedding layer, bidirectional LSTM layer and CRF layer, to avoid gradient explosion and disappearance. Gradient clipping was set to 5.0, and the learning rate was initialised to 0.001; the word embedding dimensions degree was 100, and the number of LSTM hidden units was set to 200. The number of layers was two; the batch size was fixed as 256; the epoch was set to 200, and the gradient optimizer was Adam. To avoid overfitting and to accelerate the training speed, the dropout rate was set to 0.5, and an early stop was used. That is, when there was no improvement for F1 score of the cross validation set after 10 times of training, the training was stopped.
TextCNN and BERT were deep learning models for sentence classification. The main structure of the TextCNN model was a convolution neural network; the dimension of word embedding was set to 128, the number of filters was 200 and the filter sizes were 3 × 3, 4 × 4 and 5 × 5. To prevent the training data from overfitting, the dropout rate was set to 0.5 and the epoch 20. The batch size was 512. The main structure of the BERT model was transformer; it uses transfer learning to change the output layer of Google’s pre-trained English BERT model, which is used for sentences classification tasks in the field of social sciences, where the number of hidden unit was set to 768, the number of self-attention was 12, the warm-up proportion was set to 0.1, the learning rate was 5.0 × 10−5, the batch size was 16, the maximum sequence length was 512 and the epoch was set to three.
Because the neural network involves massive matrix calculations during the training process, to accelerate the training speed, the NVIDIA Tesla P40 GPU was used in this research to train the neural network. The main parameters of the testing machine were: CPU: 48 Intel (R) Xeon (R) CPU E5-2650 v4 @ 2.20 GHz; memory: 256 GB; GPU: 6 NVIDIA Tesla P40; memory: 24 GB; operating system: CentOS 3.10.0.
5.3 Experimental results
To prevent results from being biased due to a particular data division in the experiment, the authors used tenfold cross-validation for each model, and the training corpus and test corpus used in each fold were kept the same. Eventually, the average of 10 experimental results for each model was calculated as the final measurement result.
As shown in Table 2, the sentence classification model included SVM, TextCNN and BERT, where BERT exhibited the best performance, with its precision, recall and F1 score reaching 86.28%, 86.26% and 86.23%, respectively. Due to the massive amount of data, the deep learning model outperformed the machine learning model. The F1 scores of the TextCNN and BERT models were 16.30% and 19.33% higher than those of the SVM model, respectively. Due to the use of a large-scale unsupervised corpus for pre-training and the incorporation of a transformer structure of the self-attention mechanism, the classification performance of the BERT model was 3.09%, 3.01% and 3.03% higher than those of the TextCNN model in terms of precision, recall and F1 score, respectively.
As shown in Table 2, the sequence tagging model includes CRF, Bi-LSTM-CRF and Bi-GRU-CRF. In terms of the evaluation of the sequence tagging model, the entity rather than a single word functions as the smallest unit. In this study, such an entity refers to the categorization labels for each abstract sentence.
In general, due to massive amounts of data, the CRF model, which is a typical machine learning model, had a low performance, and its precision, recall and F1 score were 64.22%, 60.70% and 62.41%, respectively. The F1 score of the CRF++ was 20.57% and 20.65% lower than those of the Bi-LSTM-CRF and Bi-GRU-CRF models. Among them, the Bi-GRU-CRF produced the highest performance in terms of precision and F1 score, reaching 83.36% and 83.06%, while Bi-LSTM-CRF had the highest recall of 82.98%. Due to their similar network structure, the Bi-LSTM-CRF and Bi-GRU-CRF models had a similar performance. However, the GRU has been simplified and improved based on LSTM, which reduces the training time and achieves similar performance. The F1 score of the GRU model was only slightly different (0.08%) from that of the Bi-LSTM model.
Overall, BERT, as the optimal sentence classification model, put up a better performance than Bi-GRU-CRF – the optimal model for sequence tagging. However, the sentence classification model requires additional sentence segmentation which will affect the performance of the classification.
To provide a better service for the identification of the functional structures of abstracts, the authors chose the optimal model of BERT with tenfold cross-validation and analysed the results produced by misclassification with a confusion matrix, as shown in Figure 6. The vertical axis represents the true tag, and the horizontal axis the predicted tag. It can be seen that Background and Purpose are often mistaken for one another.
Background was wrongly predicted as Purpose in 5.68% of the cases, and Purpose was wrongly predicted as Background in 22.69% of the cases. It can be seen, from Example 1 of Table 3, that the second sentence has no context information due to sentence segmentation, and the second sentence is predicted as the Background. Moreover, regarding the literature structure in the field of social sciences, the authors sometimes failed to write their research purposes but the background and vice versa.
Conclusions were wrongly predicted as Results in 8.29% of the cases and the rate is 6.36% in the opposite case. From Example 2 of Table 3, it can be seen that the sentences describing the Results and Conclusions have similar meanings, due to which they were incorrectly predicted. In addition, in terms of the textual structure of abstract in the field of social sciences, the boundary between Conclusions and Results then is blurred in some cases, with the two parts overlapping each other.
Background was wrongly predicted as Conclusion in 6.69% of the cases and Conclusions was wrongly predicted as Background in 6%. It can be seen from Example 3 of Table 3 that it should be predicted as the Conclusions, but the meaning of the sentence is similar to that of the Background and, therefore, it is predicted as the Background. As to academic writing techniques in the field of social science, Background can be interpreted as one Conclusion of itself from the other perspective.
Methods was wrongly predicted as Results in 3.29% of the cases and Results was wrongly predicted as Methods in 2.96%. From Examples 4 and 5 of Table 3, it can be seen that the same reason is the sentence segmentation, which results in a confusion between the meaning of the sentence describing the method and the sentence describing the result, which further leads to the model’s error prediction. Naturally, there exist some important distinctions between natural science and social sciences in terms of the writing techniques of Methods. This is especially true when it comes to questionnaire literature, as to which some similarities between the writing techniques of Methods and those of Conclusions can be found.
This shows that the main reason for the model prediction error is that the conventional sentence segmentation cannot make full use of the effective context information of the sentence, which further creates some synonymous forms in the sentences.
5.4 Model applications
This research identified the functional structures of abstracts from the perspectives of sequence tagging and sentence classification by training the model on the basis of social sciences abstract data from the WOS database. To better apply this model, the authors used a Python Django Web framework to develop an SSCI identification of the functional structures of abstracts and a search platform. As shown in Figure 7, by entering an unstructured abstract in the text box and selecting either the sequence tagging model or the sentence classification model, one can use various colours to label the functional structure of sentences so as to help researchers organise and read abstracts conveniently.
Based on the identification of the functional structures of abstracts in social sciences, a social sciences abstract faceted search platform was constructed through a detailed analysis of the structured abstracts relating to information science. After entering the information science search requirements in the text box, one can choose to search the abstracts of a particular functional structure. For example, in terms of methods, academic papers are searched to satisfy the needs of fine-grained retrieval as shown in Figure 8.
6. Conclusion
Because previous studies were generally aimed at research disciplines with a small coverage or number of journals, this study considered social sciences as the research object and investigated the distribution of sentences in this field. Via statistical analysis, the five main functions of sentences were determined as background, purpose, methods, results and conclusions. From the perspectives of sentence classification and sequence tagging, the models for the identification of the functional structures of abstracts were constructed, of which BERT produced the best performance. Specifically, its precision, recall, and F1 score reached 86.28%, 86.26% and 86.23%, respectively. Furthermore, a platform for the identification of the functional structures of abstracts in social sciences was built based on the training mode. In the meantime, a faceted abstract search service for academic users based on structured abstracts in the field of information science was also provided, which can meet the fine-grained needs of academic retrieval.
This research still has some limitations. Due to sentence segmentation, the functional structures of predicting in-sentence classification models cannot effectively use contextual information. In the future, improved sentence segmentation methods, models and text features will be used to enhance the performance of the model.
The authors thank all the participants in the study and anonymous reviewers for their constructive comments.
The authors acknowledge the National Natural Science Foundation of China (Grant Numbers: 71974094) for financial support.
Distribution of structured abstracts in disciplines of the social sciences
Example of an annotated abstract in the social science data set
Transition probability distribution of the functional structures of abstracts in the social sciences
Data sample of the sentence classification model
Data sample of the sequence tagging model
BERT tenfold cross-validation confusion matrix
Web-based identification platform for the functional structures of abstracts
Platform for the faceted search of the functional structures of abstracts
Category distribution of sentences’ functional structure
| Functional categories | Specific tags contained in functional structure | No. | (%) |
|---|---|---|---|
| Background | Background, context, introduction, etc. | 94,477 | 17.82 |
| Purpose | Purpose, originality/value, objectives, etc. | 61,896 | 11.67 |
| Methods | Design/methodology/approach, measurement(s), analysis, etc. | 129,248 | 24.37 |
| Results | Result(s), outcome measure, principal findings, etc. | 115,404 | 21.76 |
| Conclusions | Conclusion(s), discussion, limitations, implications, etc. | 129,289 | 24.38 |
Comparison between performance of the sentence classification and that of the sequence tagging models
| Performance of sentence classification models | Performance of sequence tagging models | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | TextCNN | BERT | CRF | Bi-LSTM-CRF | Bi-GRU-CRF | |||||||||||||
| No. | Precision(%) | Recall(%) | F1(%) | Precision(%) | Recall(%) | F1(%) | Precision(%) | Recall(%) | F1(%) | Precision(%) | Recall(%) | F1(%) | Precision(%) | Recall() | F1(%) | Precision(%) | Recall(%) | F1(%) |
| 1 | 67.10 | 64.68 | 65.04 | 83.27 | 83.33 | 83.26 | 86.30 | 86.31 | 86.26 | 63.82 | 60.42 | 62.08 | 83.43 | 83.14 | 83.28 | 83.54 | 82.49 | 83.01 |
| 2 | 69.13 | 67.26 | 67.64 | 83.12 | 83.17 | 83.13 | 86.44 | 86.43 | 86.40 | 63.87 | 60.42 | 62.10 | 83.17 | 83.13 | 83.15 | 83.53 | 82.85 | 83.19 |
| 3 | 69.33 | 67.88 | 68.12 | 83.13 | 83.19 | 83.14 | 86.21 | 86.19 | 86.16 | 62.97 | 59.43 | 61.15 | 83.10 | 82.59 | 82.84 | 83.09 | 82.47 | 82.78 |
| 4 | 68.95 | 66.91 | 67.43 | 83.47 | 83.50 | 83.48 | 86.47 | 86.46 | 86.43 | 64.00 | 60.52 | 62.21 | 83.49 | 82.91 | 83.20 | 83.33 | 82.77 | 83.05 |
| 5 | 69.11 | 67.17 | 67.46 | 83.39 | 83.46 | 83.40 | 86.38 | 86.35 | 86.32 | 64.93 | 61.30 | 63.06 | 83.49 | 82.88 | 83.18 | 83.50 | 83.04 | 83.27 |
| 6 | 68.98 | 67.47 | 67.54 | 83.15 | 83.21 | 83.16 | 86.33 | 86.32 | 86.28 | 65.33 | 61.69 | 63.45 | 82.48 | 82.38 | 82.43 | 83.41 | 83.00 | 83.21 |
| 7 | 68.09 | 65.70 | 66.01 | 82.93 | 82.98 | 82.93 | 86.04 | 86.01 | 85.97 | 64.56 | 61.08 | 62.78 | 82.73 | 82.42 | 82.57 | 83.30 | 82.71 | 83 |
| 8 | 68.00 | 65.55 | 65.93 | 83.22 | 83.31 | 83.26 | 86.23 | 86.20 | 86.16 | 63.73 | 60.19 | 61.91 | 83.22 | 82.82 | 83.02 | 83.56 | 83.21 | 83.38 |
| 9 | 68.50 | 66.23 | 66.61 | 83.09 | 83.08 | 83.06 | 86.17 | 86.13 | 86.11 | 63.26 | 59.70 | 61.43 | 83.00 | 83.22 | 83.11 | 83.05 | 82.84 | 82.95 |
| 10 | 69.01 | 66.90 | 67.23 | 83.20 | 83.22 | 83.14 | 86.26 | 86.19 | 86.16 | 65.68 | 62.25 | 63.92 | 83.05 | 82.99 | 83.02 | 83.33 | 82.23 | 82.78 |
| Avg | 68.62 | 66.58 | 66.90 | 83.20 | 83.25 | 83.20 | 86.28 | 86.26 | 86.23 | 64.22 | 60.70 | 62.41 | 83.12 | 82.85 | 82.98 | 83.36 | 82.76 | 83.06 |
Examples of prediction errors of BERT
| No. | Sentence | Predicted | Gold |
|---|---|---|---|
| 1 | To examine the degree to which the effects of corporal punishment are equivalent across neighbourhoods | Purpose | Purpose |
| Specifically, is corporal punishment equivalently associated with child behaviour problems in neighbourhoods that are perceived to be unsafe or disadvantaged, as compared to neighbourhoods that are perceived to be less disadvantaged? | Background | Purpose | |
| 2 | This review indicated that use of social constructivist theory in the KT literature was limited and haphazard | Conclusions | Results |
| The lack of justification for the use of theory continues to represent a shortcoming of the papers reviewed | Conclusions | Results | |
| 3 | Bladder cancer was the only cancer for which women had a significant disadvantage (RER = 1.23); this excess risk seemed to be restricted to the first 12–18 months after diagnosis | Results | Results |
| The reasons behind sex-specific differences in cancer survival are not well understood | Background | Conclusions | |
| 4 | Based on CRC three sources data, the total number of GPs was 53,630 in 2015–2016 | Methods | Results |
| Distribution of GPs per 1,000 population among the provinces indicates that provinces of Kohgiluyeh and Boyer Ahmad, Mazandaran, Golestan and Yazd with ratios of 1.28, 1.28, 1.21 and 1.17 physicians rank the highest proportion of GPs and the provinces of Sistan and Baluchestan, Ilam, Zanjan, Alborz and North Khorasan with corresponding ratios of 0.24, 0.40, 0.40, 0.43 and 0.45 GPs ranked the lowest | Results | Results | |
| 5 | Data were collected in the summer of 2003 using a prospective survey design | Methods | Methods |
| The survey was mailed to active duty soldiers on modified work plans because of musculoskeletal injuries | Methods | Methods | |
| These soldiers were assigned to one Army installation in the USA | Results | Methods |
© Emerald Publishing Limited.
