COVID‐19 clinical medical relationship extraction

Full text

Turn on search term navigation

INTRODUCTION

In the field of medical research, clinical trials are one of the important means to promote the development of human health [1]. Since 2019, Corona Virus disease 2019 (COVID-19), an acute respiratory infectious disease caused by SARS coronavirus type 2, has become an unprecedented public health crisis, posing a serious threat to human life safety [2, 3]. Due to the severity of COVID-19, the World Health Organization has raised the risk assessment to the highest level and declared it a global pandemic. According to clinical observation, its typical clinical symptoms include dry cough, dyspnoea, headache and fever, and other symptoms include muscle pain, confusion, chest pain and diarrhoea, which may lead to acute respiratory distress syndrome and septic shock in severe cases, eventually leading to multiple organ failure and even death [4].

In the clinical trial summary [5], there are many medical related semantic relationships between entities. After the entity recognition task, the relationship between entities can be further extracted. The entity relationship not only reflects the latest diagnosis methods, drug design, treatment plans, preventive measures and test purposes made by clinical researchers for the study of related diseases but also contains rich clinical trial knowledge and rules. Therefore, performing entity relationship extraction on abstracts to construct a clinical knowledge graph [6] is of great significance in improving trial efficiency, summarising trial rules, customising personalised plans, understanding the latest clinical information, improving clinical trial design and saving clinical resources.

This study takes COVID-19 as an example, uses the entity relationship extraction model in this mining technology to mine the text content of a large number of relevant clinical trial registrations, extracts the relevant clinical entities and relationships in the registration text to complete the knowledge establishment of follow-up research, laying a foundation for further research on drug recommendation, disease prediction, adverse drug reaction detection, intelligent medical question answering system etc.

The main contributions of this paper are summarised as follows:

First of all, according to the work of the unified medical language system and the former, the relationship type of the clinical trial text of this study was determined, and the COVID-19 clinical entity relationship extraction corpus was constructed.

Secondly, the pre-training model is used to extract the semantics to obtain the dynamic word vector, and the hidden deep features in the input vector are extracted through the hierarchical two-way gated loop unit network.

At the same time, the attention mechanism is introduced to capture the feature information of the sentence.

Finally, the Conditional Random Field (CRF) model is input to obtain a more accurate conditional probability of entity relationship. The experimental results show that the proposed model performs well in COVID-19 clinical entity relationship extraction task.

RELATED WORKS

Long short-term memory

In order to solve the problem that the traditional recurrent neural network (RNN) [7] may lead to gradient explosion and gradient disappearance when training long sentences, Hochreiter et al. [8] proposed Long Short-Term Memory (LSTM), which can capture long-distance dependence features in long text training. Compared with the standard RNN model, LSTM adds two modules: gating mechanism and memory unit. The memory unit is used to store text features, and the gating mechanism filters the stored information in the memory unit. Zhao et al. [9] proposed a fault diagnosis method based on the long short-term memory neural network. The new method can directly classify raw process data without specific feature extraction and classifier design. Li et al. [10] used a long short-term memory neural network method to predict tourist flows and experimentally demonstrated that the LSTM method outperformed the autoregressive integral moving average model and back-propagation neural network. And this is the first time that LSTM is applied to tourism flow forecasting.

Long Short-Term Memory model has set the input gate, forgetting gate and output gate respectively, which eliminates the possible problems of the RNN model when processing long text tasks by accumulating and updating information. Its unit structure is shown in Figure 1.

[IMAGE OMITTED. SEE PDF]

Long Short-Term Memory model is composed of the input word X_t, cell state C_t, temporary cell state ${\widetilde{C}}_{t}$ , hidden state h_t, forgetting gate f_t, input gate i_t and output gate o_t at time t. Among them, the forgetting gate determines which information is retained or discarded in the previous step; The input gate is used to process the input of the current sequence position; the output gate determines the next hidden state.

Calculate the forgetting gate according to formula (1), where the input is the hidden state h_t−1 at the previous moment and the input word X_t at the current moment. 1 ${f}_{t}=\sigma \left({W}_{f}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$

Calculate the value i_t of the input gate and the temporary cell state ${\widetilde{C}}_{t}$ . As shown in formula (2) and formula (3), where tanh is the hyperbolic tangent activation function. 2 ${i}_{t}=\sigma \left({W}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)$ 3 ${\widetilde{C}}_{t}=\mathrm{tanh}\left({W}_{C}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right)$

Calculate the cell state C_t at time t, where the input is the value of the input gate i_t, the value of the forgetting gate f_t, the temporary cell state ${\widetilde{C}}_{t}$ and the cell state C_t−1 at the previous time. The calculation formula is as follows: 4 ${C}_{t}=\sigma \left({f}_{t}{C}_{t-1}+{i}_{t}\cdot {\widetilde{c}}_{t}\right)$ The hidden state h_t−1 at the previous time, the input word X_t at the current time and the cell state h_t at the current time calculate the value o_t of the output gate at time t and the hidden state h_t, as shown in formula (5) and (6). 5 ${o}_{t}=\sigma \left({W}_{o}\cdot \left[{h}_{t-1},{X}_{t}\right]+{b}_{o}\right)$ 6 ${h}_{t}={o}_{t}\ast \mathrm{tanh}\left({C}_{t}\right)$ In the above steps, σ is the sigmoid function, tanh is the hyperbolic tangent activation function and W and b respectively represent the weight matrix and offset vector linking the two layers. After LSTM model calculation, we can finally get the hidden state sequence $\left\{{h}_{0},{h}_{1}\text{\ldots }\text{\ldots },{h}_{n-1}\right\}$ with the same length as the sentence.

For processing natural language processing (NLP) tasks (especially sequence labelling tasks), contextual content is particularly important throughout the research process, whether it is for words, phrases or characters. Usually, the common unit of LSTM is forward propagation. However, when studying sequence problems, forward LSTM cannot process the content information below, which makes the model unable to learn the following knowledge and affects the final model effect. Bi directional LSTM (BiLSTM) [11] can not only obtain the above information, but also capture the following content. It can memorise the two-way information and improve the performance of the whole NLP model [12] by obtaining the output in both directions at the same time. Xu et al. [13] used BiLSTM for sentiment analysis. The comparative experiments show that the proposed sentiment analysis method has higher accuracy, recall and F1 scores than LSTM, RNN etc. The structure of the BiLSTM model is shown in Figure 2. [14].

[IMAGE OMITTED. SEE PDF]

Its steps are as follows: Start from the front and back respectively, then calculate the LSTM of different paths and then combine the LSTMs of two different directions to obtain the BiLSTM. The forward LSTM contains the past data information of the input sequence; the backward LSTM contains the future data information of the input sequence. Then the hidden state H_t of BiLSTM at time t includes forward $\overrightarrow{{h}_{t}}$ and backward $\stackrel{{\leftarrow}}{{h}_{t}}$ . The specific formula is as follows: 7 $\begin{array}{r}\hfill \overrightarrow{{h}_{t}}=\overrightarrow{LSTM}\left({h}_{t-1},{x}_{t},{c}_{t-1}\right),t\in [1,T]\\ \hfill \stackrel{{\leftarrow}}{{h}_{t}}=\stackrel{{\leftarrow}}{LSTM}\left({h}_{t+1},{x}_{t},{c}_{t+1}\right),t\in [T,1]\\ {H}_{t}=[\overrightarrow{{h}_{t}},\stackrel{{\leftarrow}}{{h}_{t}}]\hfill \end{array}$

Gated recurrent neural network

The gated recurrent neural network (GRU) is a gating mechanism of RNN [15], which is similar to other gating mechanisms (such as LSTM). It aims to solve the gradient explosion problem in standard RNN and retain the long-term information of the sequence at the same time. The difference is that GRU extracts the really necessary elements in learning based on LSTM, combines the forgetting gate and input gate in LSTM into the updated gate unit and introduces the concept of reset gate, which not only reduces the parameters of the model, removes the cell state but also improves the speed of model training. In practical application, GRU and LSTM often have similar excellent performance, while other gated RNN variants are also difficult to defeat these two original structures in a wide range of tasks. Hafiz et al. [16] used the attention-based GRU-LSTM statement-level defect prediction method to solve the problem that software defect prediction cannot accurately predict failures. Zhou et al. [17] used the GRU model for the time series prediction of air pollutants, and the comparative experiments showed that the prediction accuracy based on the GRU model was higher. The general structure of GRU is shown in Figure 3.

Specifically, assuming the number of hidden cells is h, the small batch input ${X}_{t}\in {\boldsymbol{R}}^{{n}^{\ast }d}$ (n is the sample size, d is the number of inputs) of a given time step t and the hidden state ${H}_{t-1}\in {\boldsymbol{R}}^{{n}^{\ast }d}$ of the previous time step, the reset gate ${R}_{t}\in {\boldsymbol{R}}^{{n}^{\ast }d}$ and the update gate ${Z}_{t}\in {\boldsymbol{R}}^{{n}^{\ast }d}$ are calculated as follows: 8 ${R}_{t}=\sigma \left({X}_{t}{W}_{xr}+{H}_{t-1}{W}_{hr}+{b}_{r}\right)$ 9 ${Z}_{t}=\sigma \left({X}_{t}{W}_{xz}+{H}_{t-1}{W}_{hz}+{b}_{z}\right)$ Wherein, W_xr, ${W}_{xz}\in {\boldsymbol{R}}^{{d}^{\ast }h}$ and W_hr, ${W}_{hz}\in {\boldsymbol{R}}^{{h}^{\ast }h}$ are weight matrices, and b_r and ${b}_{z}\in {\boldsymbol{R}}^{{1}^{\ast }h}$ are deviation matrices. σ is a logical sigmoid function, so the value field of each element in the reset gate R_t and update gate Z_t is [0,1].

[IMAGE OMITTED. SEE PDF]

The hidden state ${\widetilde{H}}_{t}\in {\boldsymbol{R}}^{{n}^{\ast }h}$ calculated based on the reset gate is shown in formula (10). Wherein, ${W}_{xh}\in {\boldsymbol{R}}^{{d}^{\ast }h}$ and ${W}_{hh}\in {\boldsymbol{R}}^{{h}^{\ast }h}$ are the weight matrix, ${b}_{h}\in {\boldsymbol{R}}^{{1}^{\ast }h}$ is the deviation matrix and ⊙ is the same or operation. 10 ${\widetilde{H}}_{t}=\mathrm{tanh}\left({X}_{t}{W}_{xh}+\left({R}_{t}\odot {H}_{t-1}\right){W}_{hh}+{b}_{h}\right)$

The reset gate determines how the candidate’s hidden state at the current time depends on the hidden state at the previous time, and the hidden state at the previous time may contain the complete historical information in the time series. Therefore, the reset gate can be used to discard the historical information irrelevant to the prediction results.

Finally, the update formula of the hidden state ${H}_{t}\in {\boldsymbol{R}}^{{n}^{\ast }h}$ based on the update gate pair time step t is shown in formula (11): 11 ${H}_{t}={Z}_{t}\odot {H}_{t-1}+\left(1-{Z}_{t}\right)\odot {\widetilde{H}}_{t}$

Conditional random field

Conditional Random Field is a learning model of the discriminant probability undirected graph based on the maximum entropy model and hidden Markov model. It is commonly used to label and segment the conditional probability model of ordered data. In the CRF method, Li et al. [18] make full use of the temporal characteristics of music audio features to classify music regions. Its general definition is as follows:

Let X, Y random variables, P(Y|X) be the conditional probability distribution of Y under the given X condition. Suppose that the random variable Y forms an undirected graph G = (V, E), and $Y=\left\{{Y}_{v}\vert v\in V\right\}$ is a set of random variables Y_v with G as the middle node and v as the index. Under the condition of given X, if each random variable Y_v obeys thr Markov attribute, that is, formula (12) is true for any vertex v, then the conditional probability distribution P(Y|X) is called CRF. 12 $P\left({Y}_{v}\vert X,{Y}_{w},w\ne v\right)=P\left({Y}_{v}\vert X,{Y}_{w},w~v\right)$ Where, w ~ v represents all vertices w connected to vertex v with edges in graph G = (V, E); w ≠ v represents all vertices except vertex v in graph G = (V, E); Y_v and Y_w are random variables corresponding to vertices v and w.

Attention mechanism

Attention mechanism was first applied to computer images and then it was gradually widely used in speech recognition, natural language processing and other fields due to its excellent performance and [19–21]. Yan et al. [22] used a novel spatiotemporal attention mechanism in an encoder-decoder neural network for video captioning. Spatiotemporal attention mechanism successfully takes into account the spatial and temporal structure in the video, enabling the decoder to automatically select the most relevant time segments of important regions for word prediction.

Its working principle is to calculate the similarity between the current input unit and the entire input sentence information through a function and then assign the calculation result to the input sentence as a weight. Let α_i be the attention distribution (i.e. probability distribution), and $f\left(Q,{K}_{i}\right)$ be the attention scoring mechanism. Common scoring mechanisms include the additive model, dot product model, scaled dot product model and bilinear model. The specific formula is shown in Formula (13): 13 $f\left(Q,{K}_{i}\right)=\left\{\begin{array}{lr}{Q}^{T}{K}_{i}\hfill & \hfill \text{dot}\,\text{product}\,\text{model}\\ {Q}^{T}W{K}_{i}\hfill & \hfill \text{bilinear}\,\text{model}\\ W\left[Q;{K}_{i}\right]\hfill & \hfill \text{scaled}\,\text{dot}\,\text{product}\,\text{model}\\ {v}^{T}\mathrm{tanh}\left(WQ+U{K}_{i}\right)\hfill & \hfill \text{additive}\,\text{model}\end{array}\right.$ Where, v, W and U are the parameter matrices to be learnt for network training, K is the key, Q is the query and then the probability distribution is obtained by normalising them with a Softmax function (also known as the normalised exponential function. It is an extension of the binary classification function sigmoid on multi-classification, and the purpose is to display the results of multi-classification in the form of probability). The calculation formula is as follows: 14 ${\alpha }_{i}=Softmax\left(f\left(Q,{K}_{i}\right)\right)=\frac{\mathrm{exp}\left(f\left(Q,{K}_{i}\right)\right)}{{\sum }_{j}\mathrm{exp}\left(f\left(Q,{K}_{i}\right)\right)}$ Determine how much content is taken from the corresponding elements (Value, V) by calculating the similarity between Q and the address K of the elements in the memory. The V value corresponding to each address K will be extracted and then summed up, which is equivalent to calculating the weight of each V value based on the similarity between Q and K and then weighting the value. The final V value, that is, the Attention value, is obtained by weighted summation, as shown in formula (15): 15 $\,\text{Attention}\,(Q,K,V)=\sum\limits _{i}{\alpha }_{i}{V}_{i}$

MPNet

MPNet is a pre-training language model based on the respective characteristics of Bert and xlnet, which was jointly proposed by Nanjing University and Microsoft in 2020 [23]. Its main contribution is to integrate the advantages of Masked Language Model (MLM) and Permuted Language Model (PLM), make up for the deficiency that MLM cannot learn the dependency between tokens and overcome the problem that PLM cannot obtain the complete information visible in downstream tasks. Their experiments on various tasks show that MPNet substantially outperforms MLM and PLM, as well as previous powerful pre-trained models such as BERT, XLNet and RoBERTa.

The attention mask mechanism of the MPNet model is as follows: First, set the input sequence with length n = 6 as (x₁, x₂, x₃, x₄, x₅, x₆). If the randomly generated sequence is (x₅, x₄, x₂, x₆, x₃, x₁) and the predicted values are x₆, x₃ and x₁ respectively, then the non-predicted sequence is expressed as (x₅, x₄, x₂, [mask], [mask], [mask]), corresponding to the position sequence $\left({\mathrm{P}}_{5},{\mathrm{P}}_{4},{\mathrm{P}}_{2},{\mathrm{P}}_{6},{\mathrm{P}}_{3},{\mathrm{P}}_{1}\right)$ . Second, in order to enable the [mask] of the prediction part to see the previously predicted tokens, MPNet uses the PLM double stream self attention mechanism to complete the autoregressive generation and sets different masking mechanisms for the content stream and the query stream. For example, when MPNet predicts x₃ in the above sequence, it can see (x₅ + P₅, x₄ + P₄, x₂ + P₂) in the non-prediction part and (x₆ + P₆) in the prediction part, thus avoiding the problem of missing dependencies in MLM. In addition, in order to ensure consistency between the input information in the pre-training and the input information in the downstream task, MPNet adds mask symbols and position information ([mask] + P₆, [mask] + P₃, [mask] + P₁) in the non-prediction part, so that the model can see complete sentences. When predicting x₃, the original (x₅ + P₅, x₄ + P₄, x₂ + P₂) and the ([mask] + P₃, [mask] + P₁) with additional tokens and location information can be seen in the non-prediction part, and the previously predicted (x₆ + P₆) can be seen in the prediction part. The model that compensates the position of the query stream and the content stream by the above method can greatly reduce the input inconsistency between pre-training and fine-tuning.

ENTITY RELATION EXTRACTION MODEL BASED ON MPNet

The task of entity relationship extraction is the basis for establishing the COVID-19 clinical knowledge map. This section proposes an entity relationship extraction model suitable for COVID-19 clinical medical texts by fusing the pre-training language model MPNet, which has performed well recently.

This section first introduces the basic structure and theoretical background of the model, then determines the entity relationship types and relationship extraction tasks, completes the annotation of experimental data sets and constructs a relationship extraction corpus based on COVID-19 clinical trial texts. In order to further improve the expressiveness of the model, this model uses the Dropout overfitting mitigation strategy to improve the generalisation ability of the model and then through a number of comparative experiments to verify that the model has a better effect on the clinical medical entity relationship extraction task.

Design of entity Relation extraction model for clinical trials

Relation extraction (RE) is one of the most concerned sub-tasks in the information extraction task. The purpose is to extract the semantic relationship between two or more entities from the text, so as to build the knowledge map of related fields. Given the RE task training set $D=\left\{\left({E}_{1},{E}_{2},{R}_{e}\right),S\right\}$ , where S is the sample set, E₁ and E₂ are two entity sets and R_e represents the entity relationship set. For any d_i,j,k,l ∈ D denoted as ${d}_{i,j,k,l}=\left\{\left({e}_{1i},{e}_{2j},{r}_{k}\right),{s}_{l}\right\}$ , there exists e_1i ∈ E, e_2j ∈ E corresponding to r_k ∈ R in the sentence s_l ∈ S. Relation extraction task obtains relational mapping $\varphi \left({e}_{1},{e}_{2},s\right)\to r$ by training the model on set D, and maximises the correct mapping proportion of prediction samples in given verification set V and test set D′ with the same data distribution as the training set D.

Yang et al. [24] first proposed a hierarchical Attention network structure. By using two-layer Attention mechanisms to encode words and sentences, they can distinguish between high-quality information and low-quality features, thus optimising the previous model architecture. Therefore, the coding layer of this paper first extracts the forward and backward features of sequences through bidirectional-GRU (BiGRU) to capture the context representation of entity relationships containing semantic dependency and hierarchical structure information. Then, the multi-level Attention mechanism (MATT, including word-level Attention and sentence-level Attention) is introduced to splice word vectors, and the self-attention weight is obtained through the self-attention mechanism. The two are multiplied to obtain the sentence-level vector representation. Then, the semantic features between sentences are obtained through the sentence-level Attention layer, and the weights are spliced. Finally, the output vectors of the coding layer are weighted and summed to generate the sentence-level feature representation; Input the features output from the previous module into the CRF model of the output layer to complete the RE task of the entity.

The input unit includes the input of the text to be trained and the word vector representation of the corpus obtained through the embedding layer (MPNet). After inputting the samples to be trained, the word representation layer uses the MPNet pre-training language model to perform vector representation of words. Suppose there is a sentence sequence $S=\left[{x}_{1},{x}_{2},{x}_{3},\text{\ldots },{x}_{n}\right]$ in the training corpus V. For a word x_i in this sentence, the corresponding word vector ${e}_{{x}_{i}}\in {R}^{D}$ is obtained through matrix E ∈ R^D×|V| mapping. Finally, the sentence is converted to ${S}_{x}=\left[{e}_{1},{e}_{2},{e}_{3},\text{\ldots },{e}_{n}\right]$ through the model, where D is the dimension of the word vector, and |V| is the size of the feature matrix of the training corpus. The word vector dimension and position vector dimension of this model are set as 768 with reference to BERT_Base configuration. The text information combined with relative position and absolute position information is obtained by splicing the two vectors. The input calculation method accepted by the transformer is as follows, where pos is the position index, d_model is the vector dimension and i is the dimension index.

Formula (16) and (17) represent the 2i, 2i + 1 components of the encoding vector of position pos. The embed mentioned in formula (18) is actually the embedding process in the transformer. The process of embedding is to digitise all useful information (information that needs to be given to the model), which is mainly reflected in the digitisation of location information. It can be understood as a function here Figure 4. 16 $Positio{n}_{\mathit{Embedding}}(pos,2i)=\mathrm{sin}\left(p/1000{0}^{2i/{d}_{\mathit{model}}\times d}\right)$ 17 $Positio{n}_{\mathit{Embedding}}(pos,2i+1)=\mathrm{cos}\left(p/1000{0}^{2i/{d}_{\mathit{model}}}\right)$ 18 $Input=Embed(token)+Embed(position)$

[IMAGE OMITTED. SEE PDF]

The RE model proposed in this study uses the BiGRU neural network in both the word-level and sentence-level encoding layers. BiGRU retains the long-term memory ability of LSTM [25] for long texts and also has the ability to learn two-way encoding of texts. At the same time, it simplifies BiLSTM's internal structure. By constructing forward GRU and reverse GRU, the hidden deep features in the input vector are extracted, and the contextual semantic relations in the input sequence are fully learnt and encoded. Given the current time t, BiGRU will calculate the forward hidden state $\overrightarrow{{h}_{t}}$ according to the hidden state $\overrightarrow{{h}_{t-1}}$ of the previous time t − 1 and the current sequence input I_t. At the same time, the reverse hidden state $\stackrel{{\leftarrow}}{{h}_{t}}$ is calculated according to the input of the hidden state $\stackrel{{\leftarrow}}{{h}_{t-1}}$ at time t − 1 and the current sequence I_t and then perform the weighted summation of the above two state vectors to obtain the hidden layer state h_t at the current moment. The execution process of $\overrightarrow{{h}_{t}}$ and $\stackrel{{\leftarrow}}{{h}_{t}}$ is shown in formulas (19) and (20). The calculation method of the encoding output h_t of the word-level BiGRU model is shown in formula (21): 19 $\overrightarrow{{h}_{t}}=\overrightarrow{GRU}\left({I}_{t},\overrightarrow{{h}_{t-1}}\right),t\in [1,T]$ 20 $\stackrel{{\leftarrow}}{{h}_{t}}=\stackrel{{\leftarrow}}{GRU}\left({I}_{t},\stackrel{{\leftarrow}}{{h}_{t-1}}\right),t\in [T,1]$ 21 ${h}_{t}=[\overrightarrow{{h}_{t}}\cdot \stackrel{{\leftarrow}}{{h}_{t}}]$ Among them, the function GRU() performs non-linear transformation on the output vector of the word representation layer, and T represents the maximum time.

Similarly, given the input sentence S_i, encode S_i according to the sentence-level BiGRU coding layer, and the calculation method of the output h_i is as follows: 22 $\overrightarrow{{h}_{i}}=\overrightarrow{GRU}\left({S}_{i},\overrightarrow{{h}_{i-1}}\right),t\in [1,L]$ 23 $\stackrel{{\leftarrow}}{{h}_{i}}=\stackrel{{\leftarrow}}{GRU}\left({S}_{i},\stackrel{{\leftarrow}}{{h}_{i-1}}\right),t\in [L,1]$ 24 ${h}_{i}=[\overrightarrow{{h}_{i}}\cdot \stackrel{{\leftarrow}}{{h}_{i}}]$

According to the hidden state output sequence of the BiGRU module, the model introduces a Self-Attention mechanism to learn word-level features and merges the above features with sentence-level feature vectors. For the spliced information, the sentence-level Attention layer is introduced to calculate the attention [26] weight vector and then the weight vector is normalised to obtain the weight probability distribution assigned by the sentence-level Attention mechanism and finally the weighted summation with the output vector of the BiGRU layer is obtained via the feature vector output by the Attention layer.

Given the hidden state word-level output sequence h_w = {h_w1, h_w2, …, h_wM} of the BiGRU layer, let the corresponding feature representation vector be H_w = {h_w1, h_w2, …, h_wM}, where M represents the character length of the sequence. Pass H_W into the Self-Attention module to calculate the weight vector α of the hidden state. The calculation process is as follows: 25 $M=\mathrm{tanh}\left({H}_{w}\right)$ 26 $\alpha =Softmax\left({w}^{T}M\right)$

By multiplying the weight vector α and H, the weighted sum m of the output vector is calculated as follows: 27 $m={H}_{w}{\alpha }^{T}$ where H_w ∈ R^dw, dw is the dimensions of each word vector output by the hidden layer, w is the model training parameter and w^T represents the transpose of w. Merge h_w into the feature representation of the sequence through the Attention layer. The specific calculation formula is as follows: 28 ${h}_{w}^{\ast }=\mathrm{tanh}(m)$

The given sentence sequence is encoded by BIGRU to obtain the sentence-level vector m_i and then the sentence-level Attention mechanism and context vector m_k are introduced. The calculation process is as follows: 29 ${\alpha }_{i}=\frac{\mathrm{exp}\left({m}_{i}^{T}{m}_{k}\right)}{{\sum }_{i}exp\left({S}_{i}^{T}{S}_{k}\right)}$ 30 ${h}_{s}^{\ast }={{\Sigma }}_{i}{h}_{w}^{\ast }{\alpha }_{i}$ Among them, α_i represents the weight vector obtained after introducing the sentence-level Attention mechanism, and ${h}_{s}^{\ast }$ represents the sentence-level feature representation.

After BiGRU feature extraction and MAtt mechanism weight operation, some constraint rules are automatically added to the final predicted label (the output result of the BiGRU layer) through the CRF layer: (1)The first word of the entity should start with “B-” or “O”; (2)Valid patterns should be “O B-[Label]” or “B-[Label] I-[Label]”; the “O I-[Label]” mode is regarded as the invalid mode. Finally, in the decoding stage, the Veterbi algorithm is used to obtain the label sequence with the highest predicted total score in the sequence, and use it as the entity relationship classification result of the COVID-19 clinical trial registration paper.

Overfitting mitigation strategies

In practical applications, sampling errors are often mixed into machine learning models during training, and these sampling errors are fitted during the training process, resulting in the model often only getting better performance on the training set. This type of poor generalisation is called overfitting when applied to test, validation or other datasets that produce the opposite result. The main reasons for overfitting can be summarised as follows: (1) The model has too many parameters, the structure is too complex and it has too strong academic ability; (2) There are too few training samples, and the model cannot fully learn all the features in the training set; (3) The imbalance between the content of training samples and test samples leads to unsatisfactory real output results.

Theoretically, in order to avoid the overfitting problem of the model, the model can learn more features by increasing the number of training samples, so that the model can perform better on the test samples. However, in practical application, the workload of the entity labelling task is extremely large, especially the overall number of COVID-19 clinical trial abstracts is not large, and it also contains various complex medical terms, which leads to inefficient labelling work and ultimately leads to insufficient training samples. Although the model can achieve better performance on the training set, there may be inaccurate classification results on the test set and validation set. In order to solve the problem of small sample overfitting, this paper refers to the Dropout overfitting mitigation strategy proposed by Hinton et al [27]. By ignoring some neurons according to a certain probability in each training, the generalisation ability of the model is improved without relying too much on local features. Under normal circumstances, the Dropout value is selected in the range of 0.2–0.5. The model in this study uses the Dropout strategy in both the word representation layer and the encoding layer.

EXPERIMENTS

Introduction to experimental data

This article uses the COVID-19-related clinical trial registration data in the US clinical trials registry (CT, ). CT contains private or publicly funded clinical research projects carried out by clinical researchers around the world, including information about medical research of human volunteers, such as disease, intervention measures, title of the study, experimental design, inclusion/exclusion criteria and location of the study. At present, many researchers have used this database for investigation and research.

Krishna et al. [28] evaluated the evidence characteristics and expected strength of COVID-19 research that is registered on the platform and proposed the problems and improvement direction of relevant clinical trials. Amy et al. [29] analysed the death cases in the clinical trial and compared the papers published in the trial with the data in the trial and found that the records were inconsistent. Pradhan et al. [30] developed a Python-based software application (EXACT) and automatically extracted the data required for meta analysis from the database in the format of a spreadsheet. extracted data with 100% accuracy, saving 60% of the time compared with the method of manually extracting data from journal articles. Federer et al. [31] used Python script to extract information from ClinicalTrials and then used regular expressions and drug dictionaries to process and structure the relevant information into a relational database, where they conducted data mining and pattern analysis on adverse drug events. The database can be used as a tool to help researchers find the drug adverse event relationship, so as to develop, reposition and reposition drugs.

Entity relationship definition

The relationship extraction task based on COVID-19 clinical trial registration data is to determine the semantic relationship between every two medical entities in the abstract. When there is an association between two entities, the relationship extraction task is treated as a classification task. Before this, the most critical thing is to determine the type of clinical entity relationship, and then define the entity relationship in the sentence according to the meaning of the sentence.

The clinical medical records contain rich medical information. Meystre et al. [32] suggested that “problem oriented medical records” be adopted to collect information guiding diagnosis and nursing plans, as well as a series of medical behaviours for patients' problems. Uzuner et al. [33] classified the semantic relationships of clinical medical related abstracts, created an entity relationship system targeting problem-oriented records according to the characteristics of data and defined a series of semantic relationships involving patients' medical problems. Based on the former research work, this paper defines the following six clinical trial data relationship types: Disease-Treatment Relationship Type; Disease-Test Relationship Type; Disease-Biopharmaceutical Relationship Type; and Disease-Extent Relationship Type. The entity relationship types defined in this study and their related concepts are listed in Table 1.

TABLE 1 Clinical trial data entity relationship type definitions.

No.	Entity relationship type	Description of related meanings
1	Disease_Procedure	Relationship between disease name and corresponding treatment
2	Disease_Item	The relationship between the disease name and various inspection items and physiological indicators involved in the test
3	Disease_Drug	Relationship between disease names and related biologics used in intervention trials
4	Disease_Severity	The relationship between the disease name and the severity of the patient's infection
5	Severity_Drug	Relationship between severity of patient infection and related biologics used in intervention trials
6	Symptom_Drug	The relationship between clinical manifestations of disease and corresponding use of related biological agents

Finally, use the Brat tool to annotate the samples and export the annotation data through the ann format file. Table 2 [34] is an example of the annotation results of the ann file. After completing the labelling task, match the labelling file with the content in the text file and preprocess the data, focussing on removing blank lines and illegal characters; the samples are divided by a combination of punctuation marks and sliding windows, and the two entity pairs are combined with each other to segment a sentence sequence containing both and filter out too long sentences to avoid noise that is affecting the training of the model.

TABLE 2 Example of ann file annotation results.

Relationship number	Entity relationship	Entity pair
R1	Disease_Drug	Arg1:T4 Arg2:T6
R2	Disease_Severity	Arg1:T4 Arg2:T7
R3	Disease_Drug	Arg1:T13 Arg2:T15
R4	Disease_Item	Arg1:T13 Arg2:T23
R5	Severity_Drug	Arg1:T8 Arg2:T7
R6	Symptom_Drug	Arg1:T17 Arg2:T19
R7	Disease_Drug	Arg1:T21 Arg2:T20
R8	Disease_Item	Arg1:T21 Arg2:T26
R9	Disease_Item	Arg1:T29 Arg2:T32

Evaluation indicators

The RE task [35] usually uses the precision rate (P), recall rate (R) and F₁ value as the evaluation indicators of the model. The corresponding calculation formula of each indicator is as follows: 31 $P=\frac{TP}{TP+FP}$ 32 $R=\frac{TP}{TP+FN}$ 33 ${F}_{1}=\frac{2PR}{P+R}=\frac{2TP}{2TP+FP+FN}$ where TP refers to the number of correctly predicted positive examples as positive, FP refers to the number of incorrectly predicted negative examples as positive and FN refers to the number of incorrectly predicted positive examples to be negative. The sum of TP and FN represents the total number of positive examples. In the RE task, the relation prediction result is regarded as positive if and only if the classification results of the entity-to-entity relation are correct.

Experimental parameter configuration

This experiment is built and run in the Ubuntu 16.04 operating environment, the version numbers of Python and PyTorch are 3.7.0 and 1.7.1, respectively, and the server graphics card configuration is NVIDIA RTX3080 (16G). The hyperparameters of the MPNet pre-training model are consistent with Bert_Base, which is composed of 12 Transformer layers; the hidden layer dimension is set to 768, and the 12 attention head modes are used in this study. GELU is used as the activation function. In the training phase, the maximum sequence length is 256, the batch_size is 128, the MPNet learning rate is set to 3e-5, the Dropout in the training phase is set to 0.1 and the model is trained by the Adam optimisation algorithm [36].

Parameter optimisation

In order to alleviate the problem of model over-fitting, the Dropout selection experiment is used to compare and analyse the evaluation indicators corresponding to different Dropout values, so as to select the parameter values that are suitable for the model. The experiment sets five Dropout parameter values of 0.1, 0.2, 0.3, 0.4 and 0.5. The experimental results are shown in Figure 5, where the ordinate is the score of each index, and the abscissa is the Dropout value set in the experiment [37]. The results show that the F1 score increases first and then decreases with the change of the value. When the Dropout value is 0.3, it reaches the highest point. At this time, the model obtains the best F1 score.

[IMAGE OMITTED. SEE PDF]

Experimental results and analysis

In order to verify that the proposed medical RE model effectively improves the classification effect of COVID-19 clinical trial registration data, the following five methods are designed for comparison: (1) BiGRU-CRF benchmark model, which implements word embedding through the glove layer on SemEval 2014 dataset and then uses BiGRU module as the backbone network, which reduces the complexity of the model and effectively alleviates the possible over fitting problem caused by BiLSTM compared with BiLSTM. Then the CRF module is connected to jointly train the attention vector and the label vector to obtain the entity relationship label corresponding to the text; (2) In the BiGRU-Softmax model, in order to verify the effectiveness of the CRF sequence label layer of this model, the CRF reasoning layer in the previous model is removed, and the Softmax function output is directly used as the result. BiGRU-Softmax is also one of the main relationship extraction methods widely studied before; (3) The BiGRU-Att-CRF model, which uses Word2vec word embedding as the input representation of the text. At the same time, based on the BiGRU-CRF benchmark model, it introduces a multi-layer attention mechanism to learn the impact of different features on relationship classification from the two dimensions of words and sentences and improve the training efficiency and recognition effect of the model by reducing model parameters and setting different weight values; (4) The BERT-BiGRU-CRF model achieves better context dependency and parallelism through the Bert pre-training language model. At the same time, BiGRU is used to fine-tune the results generated by the upper layer, which can more accurately extract the effective features of the text; (5) The XLNet-BiGRU-CRF fusion model [38] uses the well-performing XLNet model to fully integrate the contextual features to obtain the semantic representation of the input sequence and then complete the feature extraction through the BiGRU-CRF network and calculate the label sequence probability to output the final predicted label.

A number of comparative experiments were established in the same experimental environment. The results are shown in Table 3 and Figure 6. Compared with other methods, the relationship extraction method based on the MPNet language model and Matt mechanism has improved each evaluation index. The specific comparative analysis is as follows:

(1)
In the sequence annotation layer, the benchmark model BiGRU-CRF is compared with BiGRU-Softmax. It is found that Softmax's label prediction effect is not as good as that of the CRF model. The F1 value of the former is 58.53%, while the F1 value of the latter is 61.85%. After the introduction of the CRF layer, the F1 value of the entity relationship extraction model has increased by 3.32%. This shows that the CRF algorithm really plays an important role in the text relationship classification of COVID-19 clinical trial.
(2)
The BiGRU-Att-CRF model improves the overall performance by introducing an Attention layer into the base structure of the benchmark model. Comparative experiments show that this method improves the precision by 2.04% and the recall rate by 2.01%. The F1 value is increased from 61.85% to 63.88%, which proves that the Attention mechanism can notice the relationship dependencies between contexts. By dynamically adjusting the weight parameters of each element to focus on more similarities with the input elements and considering both global and local connections, the model has parallel computing capabilities, thereby effectively improving the expressiveness of the model.
(3)
In the text representation layer, the BERT-BiGRU-CRF and XLNet-BiGRU-CRF, which incorporate pre-trained language models, are compared with the benchmark models. Table 3 shows that all indicators have been improved after the introduction of the language model, and the XLNet model has certain advantages in transfer learning. The MPNet-BiGRU-MAtt-CRF model proposed in this study is compared and analysed with the former two. The overall accuracy of the RE model using the MPNet pre-training model and the multi-layer Attention structure is higher, and the F1 value is increased by 5.42% and 3.18%, which proves that the fusion model proposed in this study has better performance.

TABLE 3 Comparison results of each experiment (%).

Model	Accuracy	Recall	F1 value
BiGRU-CRF	58.32	65.83	61.85
BiGRU-Softmax	54.51	63.20	58.53
BiGRU-Att-CRF	60.36	67.84	63.88
BERT- BiGRU-CRF	64.19	68.07	66.07
XLNet- BiGRU-CRF	66.03	70.75	68.31
MPNet-BiGRU-MAtt-CRF(Ours)	70.16	72.88	71.49

[IMAGE OMITTED. SEE PDF]

CONCLUSIONS

This paper proposes a deep learning method based on the COVID-19 clinical trial data RE model. The model adopts the MPNet model, BiGRU network, MAtt mechanism and CRF reasoning layer integrated architecture to improve the problem that static word vectors cannot represent ambiguity through pre-training language models, using BiGRU network to replace the current general BiLSTM structure to obtain feature vectors from the input and make full use of the previous and next token information of each word. At the same time, the simplified LSTM network structure improves the training efficiency of the model. In addition, word-level and sentence-level attention mechanisms are also introduced into the model to fully learn different features to improve the effect of relation classification. Through comparative experiments, it is proved that the deep learning model used in this study has good performance in the task of entity-relationship classification of COVID-19 clinical texts.

AUTHOR CONTRIBUTIONS

Su Qianmin conceived the idea. Pan Wei and Cai Xiaoqiong designed the study. Cai Xiaoqiong did the analyses. Su Qianmin and Pan Wei wrote the main manuscript text. Cai Xiaoqiong prepared Tables 1 to 3 and Figures 4–6. Pan Wei prepared Figures 1–3. Huang Jihan. provide expertise and guidance in neural networks. Ling Hongxing and Huang Jihan assisted in our research, including searching papers and doing preliminary reading of papers. All authors reviewed the manuscript.

ACKNOWLEDGEMENT

This work was supported by Science and Technology Innovation 2030—Major Project of “New Generation Artistic Intelligence (2020AAAA0109300)”.

CONFLICT OF INTEREST STATEMENT

The author reports no conflicts of interest in this work.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Yu, H., Liu, J.: Overview of international clinical trial registration. J. Integr. Chin. West. Med. 5(003), 234–242 (2007). [DOI: https://dx.doi.org/10.3736/jcim20070302]

Rongen, Y., Jiang, X., Dang, D.: Named entity recognition by using xlnet‐bilstm‐crf. Neural Process. Lett. 53(5), 3339–3356

Word count: 6483

Show less

© 2023. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

With the rapid development of biomedical research and information technology, the number of clinical medical literature has increased exponentially. At present, COVID‐19 clinical text research has some problems, such as lack of corpus and poor annotation quality. In clinical medical literature, there are many medical related semantic relationships between entities. After the task of entity recognition, how to further extract the relationships between entities efficiently and accurately becomes very critical. In this study, a COVID‐19 clinical trial data relationship extraction model based on deep learning method is proposed. The model adopts MPNet model, bidirectional‐GRU (BiGRU) network, MAtt mechanism and Conditional Random Field inference layer integration architecture and improves the problem that static word vector cannot represent ambiguity through pre‐trained language model. BiGRU network is used to replace the current Bi directional long short term memory structure and simplify the network structure of Long Short Term Memory to improve the training efficiency of the model. Through comparative experiments, the proposed method performs well in the COVID‐19 clinical text entity relation extraction task.

Details

Title

COVID‐19 clinical medical relationship extraction based on MPNet

Author

Qianmin, Su¹

; Wei, Pan¹; Xiaoqiong, Cai¹; Hongxing, Ling²; Jihan, Huang³

¹ School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
² Shanghai Business and Information College, Shanghai, China
³ Center for Drug Clinical Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China

Pages

119-129

Section

ORIGINAL RESEARCH

Publication year

2023

Publication date

Jun 1, 2023

Publisher

John Wiley & Sons, Inc.

e-ISSN

23983396

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1049/cps2.12049

ProQuest document ID

3092313297

COVID‐19 clinical medical relationship extraction based on MPNet

Jump to:

Full text

Abstract

Details

Suggested sources