Exploiting Diverse Information in Pre-Trained

Full text

Turn on search term navigation

1. Introduction

Machine reading comprehension (MRC) [1,2,3,4] has attracted increasing interest in recent years. It is presented to evaluate how well a machine understands human language by asking the machine to answer questions according to a given passage. In particular, multi-choice machine reading comprehension [2] is one of the most difficult tasks in MRC since it usually requires different natural language processing skills such as word matching, syntactic structure matching, logical reasoning and summarization, compared to span-selection tasks [1]. In other words, addressing multi-choice reading comprehension tasks requires various information items due to the abundant diversity of the questions, options and passages. Therefore, researchers propose a variety of methods to provide richer information for the different reading comprehension samples. Among them, pre-trained language models such as BERT [5] and RoBERTa [6] have become an important trend.

A pre-trained language model usually uses several self-supervised pre-training tasks to train a deep transformer-based network [7] and learn rich information from large-scale corpora. It brings great success to MRC and achieves SOTA performance in multiple tasks simply by fine-tuning. Jawahar et al. [8] conducted a series of experiments on BERT and found that its intermediate layers learn a rich hierarchy of linguistic information. Specifically, BERT learns phrase-level information and surface information in the low layer, syntactic features in the middle layer and semantic features at the top layer. In other words, BERT naturally contains different information at different layers of the network.

However, existing methods generally use the top-layer representation of pre-trained models to address the MRC tasks [3,9,10]. Therefore, while the pre-trained models encode different information at different layers, the current methods cannot directly exploit the diverse information for the various questions and passages. How to adaptively select the required information in pre-trained model for MRC task is still a problem to be solved.

An intuitive idea to address the above problem is building multiple decision modules by making use of the outputs at different layers of pre-trained models respectively, and then synthesizing the multiple decisions to reach the final solution. In this way, the representations at intermediate layers could be exploited directly, which helps the different samples to utilize the required information from different layers. However, another problem arises. Since each decision module has the similar supervisory signal, the output representations at those layers tend to be similar at fine-tuning. This damages the information diversity in different layers of the original pre-trained models. An effective way should make use of the multi-layer representations without damaging the information diversity.

This paper therefore proposes a simple but effective multi-decision based transformer model with learning rate decaying to address the above problem. For a pre-trained model with stacked L transformer layers, we divide them evenly into N blocks, each block consists of $L / N$ Transformer layers. We then use the output of each block for decisions. For example, if $L = 12$ and $N = 3$ , the representation of the 4th layer (the output of 1st block), the 8th layer (the output of 2nd block) and the 12th layer (the output of 3rd block) are used. Each block answers the input question separately. The final answer is then chosen by two manners: “performance first” and “speed first”. When updating the parameters, in order to avoid the information diversity being damaged, we weaken the influences of supervisory signals on lower layer parameters. This is achieved by a learning rate decaying method which gradually reduces the learning rate from top to bottom of the transformer stack. Experimental results demonstrate that our model is able to answer the questions by extracting the required information from intermediate layers and speed up the inference procedure while retaining considerable accuracy. The source code is available at https://github.com/bestbzw/Multi-Decision-Transformer.

To sum up, the main contributions of this paper are listed as follows.

We propose a simple but effective multi-decision based transformer model that adaptively selects information at different layers to answer different reading comprehension questions. To the best of our knowledge, this is the first occasion explicitly making use of the information at different layers in pre-trained language models for MRC.
We propose a learning rate decaying method to maintain the information diversity in different layers of pre-trained models, avoiding it being damaged by the multiple similar supervisory signals during fine-tuning.
We conduct a detailed analysis to show which types of reading comprehension task can be addressed by each block. Moreover, the experimental results on five public datasets demonstrate that our model increases the inference speed without sacrificing the accuracy.

2. Related Work

Pre-trained Language Model. Early research on pre-training neural representations of language mainly attends to pre-trained context-free word embeddings, such as Word2Vec [11], GloVe [12]. Since most Natural Language Processing (NLP) tasks are beyond word level, researchers propose to pre-train sequential encoders, such as LSTM, to obtain contextual word embeddings [13].

Recently, the very deep pre-trained language models (PLMs) [5,6] have improved the performance on multiple natural language processing benchmarks [14], which is credited to the strong prior knowledge and the deep network structure. OpenAI GPT [15] proposes the first transformer-based PLM with a left-to-right language modeling and auto-encoder objective. BERT [5] improves GPT by using a Masked Language Model (MLM) pre-training objective to learn bidirectional encoder representations, and a Next Sentence Prediction (NSP) objective to learn the relation between sentence pairs. BERT encodes diverse information in different layers. Jawahar et al. [8] conduct a series of experiments to unpack the elements of English language structure learned by BERT. This work finds that BERT captures structural information about language. It learns phrase-level information and surface information in the low layer, syntactic features in the middle layer and semantic features at the top layer. Moreover, Guarasci et al. [16] also find that the middle layer of BERT, rather than the top layer, performs best on syntax task.

Following GPT and BERT, a lot of transformer-based PLMs are proposed. These models use more data [6] and parameters [17], modified word masking methods [6,18] and transformer structures [19,20] to significantly improve the SOTA performance on multiple NLP datasets.

Machine Reading Comprehension. Early Machine Reading Comprehension (MRC) systems use rules-based methods or feature engineering methods to extract answers from passages. Deep Read [21] proposes the first reading comprehension system. Given a story and a question, it uses sentence matching (bag-of-words) techniques and some linguistic processing methods, such as stemming and pronoun resolution, to select a sentence from the story as the answer to the question. Quarc [22] and Cqarc [23] are two rules-based reading comprehension systems which design some hand-crafted heuristic rules to look for the semantic clues and answer the given questions. These methods depend on human-defined feathers and are difficult to generalize to new datasets [24].

The rapid growth of MRC is largely thanks to the development of deep learning [25,26] and the availability of large-scale datasets [1,2]. The attentive reader [26] proposes the first neural MRC models based on LSTM and attention mechanism, in which two cloze-style corpuses are also presented.

Then the attentive reader is extended and enhanced in several ways. For the embedding module, BIDAF [27] introduces char-level embedding into the model that alleviates the out-of-vocabulary (OOV) problem. DrQA [28] improves the embedding representations of both question and passage with POS embedding, NER embedding and binary feature of exact match. For the encoder, QANet [29] replaces the LSTM encoder with convolution network to accelerate the training and inference processes. To better fuse the question and passage, BIDAF [27] proposes a bi-directional attention flow mechanism and GA [30] presents a static multi-hop attention architecture, both of which obtain a fine-grained question-aware passage representation. DFN [31] proposes a sample-specific network architecture with dynamic multi-strategy attention process, lending flexibility to adapt to various question types that require different comprehension skills. Hierarchical Attention Flow [32] models the interactions among passages, questions and candidate options to adequately leverage candidate options for multi-choice reading comprehension task. Besides these, there are also many works [27,29,33] that use an additional encoder to capture the query-aware contextual representation of passage after attention.

More recently, BERT-like models [5,6,19] have been applied in the MRC task and some even surpass human performance on multiple datasets [1,2]. Most models take BERT as a strong encoder and propose some task-specific modules upon it. DCMN [9] and DUMA [34] add an interaction attention layer to model the relationship among passage, question and options. NumNet [35] and RAIN [36] introduce number relation into the BERT-based MRC network for addressing the arithmetic question. However, most of these studies take PLM as a black box and only use the output hidden states at the top layer [3,9,10,34], wasting the rich information at other layers. As far as we know, there is still no work to leverage the different level representations of PLM to address MRC tasks.

Multi-Decision Network. Most of the previous work utilizes diverse information by building multiple decision modules. These modules are either located in the same layer of the neural network (parallel) [37], or located in different layers (serial) [31,38,39,40].

All these models are trained from a random initialization state, so that they take much pains to differentiate the input representation of each decision module. This goal is usually achieved by assigning different training samples to each decision module. Some works [31,38,39] employ a sample-specific network architecture with reinforcement learning to spontaneously determine which decision module to train. This method is sometimes unstable and each sample is only assigned to one decision module which reduces the size of the training corpus for each module. Zhou et al. [40] proposes a soft way with a weighted loss function to adjust the distribution of training samples on different decision modules.

BERT-based multi-decision models are the most similar work to ours. They add classifiers to the output of each transformer layer and aim to reduce the inference time of BERT in sentence classification task. FastBERT [41] adopts a self-distillation mechanism that distills the predicted probability distribution of the top-layer classifier (as a teacher) to other classifiers at the lower layers (as students). The students simply imitate the teacher, but neglect learning the cases that only the lower layers could solve. This method damages the information diversity of the original BERT model. In contrast, DeeBERT [42] completely stops the gradient propagation from the classifiers at the lower layers to the transformer network. This keeps the information diversity but loses the performance at the lower layers.

3. Models

The overall structure of our model is illustrated in Figure 1. It consists of two parts: A bidirectional transformer-based encoder and multiple decision modules. The encoder is composed of a stack of L identical transformer layers. We divide the encoder equally into N blocks from bottom to top, so that each block contains $L / N$ stacked transformer layers (the output of nth block is the output of the $\frac{n * L}{N}$ th layer). A decision module is built for each block by making use of the output of the block.

Our model can also be easily rebuilt upon the existing pre-trained model as its backbone. For example, a BERT_base model with 12 transformer layers can be used as a backbone, where outputs of three layers (4th, 8th and final layer) are chosen for building separate decision modules. Details of the model architecture and the training methods are given in rest of the section.

3.1. Model Architecture

The input S of our model is a sequence with length $| S |$ . For different tasks, the input sequence and the decision modules are slightly different. We first introduce the basic multi-decision-based transformer architecture in this section and then expound it for specific tasks in Section 3.3.

3.1.1. Embedding

Firstly, our model transforms the symbolic input sequence S into distributed representation by the embedding function.

(1) $E = Embedding (S)$

The representation E is the summation of the token embeddings, the segmentation embeddings and the position embeddings.

3.1.2. Backbone

Next the embedding sequence is encoded by the L transformer layers sequentially. Each layer is composed of a multi-head self-attention sublayer and a position-wise, two-layer fully connected feed-forward network sublayer. The input of a layer is the output of the previous layer. Let $H_{i}^{B}$ be the output of the i-th layer; apparently, its input is the output of the $i - 1$ -th layer. The i-th layer is therefore defined as,

(2) $H_{i}^{B} = {TransformerLayer}_{i} (H_{i - 1}^{B})$

where

H_{i}^{B} \in R^{| S | \times d}

is the output of the ith layer,

| S |

is the length of the input sequence, d is the hidden size of the transformer. For the first layer (bottom layer), the input

H_{0}^{B}

is the embedding representation E. Specific implementation of

{TransformerLayer}_{i} (\cdot)

is described in [7].

We denote $H_{n} = H_{L * n / N}^{B}$ as the output representation for the nth block. To conveniently describe the model, the encoder for each block is defined as,

(3) $H_{n} = f_{n} (H_{n - 1}, θ_{n}^{t})$

where the function

f_{n}

is a stack of

\frac{L}{N}

transformer layers,

θ_{n}^{t}

denotes the parameters of the

{\frac{(n - 1) L}{N} + 1, \dots, \frac{n L}{N}}

th transformer layers.

Following the encoding output of each block is the decision module. The score of the candidate answers in nth block is calculated by the decision module,

(4) $P_{n} (Y | S) = g_{n} (H_{n}, θ_{n}^{d})$

where

P_{n} (Y | S)

is the probability distribution over the candidate answer set

C

and

θ_{n}^{d}

is the parameters of the decision function

g_{n} (H_{n}, θ_{n}^{d})

. The specific implementation of the decision module will be expounded in the Section 3.3.

Then the predicted answer of this block is determined by the probability,

(5) $a_{n} = \underset{c \in C}{argmax} P_{n} (c | S)$

where

P_{n} (c | S)

indicates the predicted probability of the candidate c.

3.1.3. Decision Module

With the above steps, each decision module outputs a probability distribution and an answer. When selecting the final answer from these intermediate prediction results, we apply two manners for performance improvement and inference acceleration, respectively.

Performance first. For each layer, the probability of the predicted answer where $a_{n}$ is denoted as where $P_{n} (a_{n} | S)$ . The answer with the highest probability is selected as the final answer.

(6) $a = \underset{{a_{n} | n : 1 \leq n \leq N}}{argmax} P_{n} (a_{n} | S)$

Speed first. If a decision module is confident enough about its predicted answer, it is not necessary to execute the following inference steps. This is also beneficial in terms of reducing the inference time. Inspired by [41], we use a normalized entropy as the uncertainty to measure whether to terminate the current inference. The uncertainty of the nth decision module is defined as,

(7) $U_{n} = \frac{\sum_{c \in C} P_{n} (c | S) * log P_{n} (c | S)}{- log | C |}$

where

| C |

is the number of candidate answers. During inference, if

U_{n} \leq v

, then the inference is stopped and

a_{n}

is selected as the final answer; otherwise the model will move on to the next layer.

v \in (0, 1)

is a threshold to control the inference speed. If all uncertainties of the decision modules are greater than the threshold, we adopt the first method to determine the final answer.

3.2. Training with Learning Rate Decaying

3.2.1. Loss Function

There is no explicit annotation or rule to indicate which layer should be selected for a sample; therefore, we directly train each decision module on the whole dataset to avoid designing annotation rules.

For each decision module, the loss $L_{n}$ is the negative likelihood of the ground-truth answer $\hat{a}$ .

(8) $L_{n} = - \frac{1}{| D |} \sum_{S, \hat{a} \in D} log P_{n} (\hat{a} | S)$

where

D

is the training set. The final loss is the summation of the losses in all blocks.

3.2.2. Learning Rate Decaying

While the above method exploits the representations at difference layers, the decision modules in lower blocks have the same supervisory signal as the top block, leading to the output representation of each block tending to be similar. This damages the information diversity in the original pre-trained language models. We have showed the phenomenon by comparing the information similarity between the general BERT model and the fine-tuned BERT model with multiple supervisory signals in Section 4.7.

Interrupting the gradient propagation from intermediate-block decision modules to the transformer network is an effective way to maintain the information diversity (equivalent to setting the learning rate to 0); however, it abandons fine-tuning the transformer to fit the lower-block decision modules, causing their performance to drop. We hope to find a balance between adjusting the lower-block parameters and maintaining the original rich information. An intuitive method is to let the learning rate of the lower blocks be smaller than top block and greater than zero.

We propose a learning rate decaying method to update the parameters in our model. Let $l r$ be the initial learning rate for the parameters in the Nth block (top block). As shown in Figure 2, for nth block of the transformer, the learning rate is set to $\frac{l r}{α^{N - n}}$ , where $α$ is a positive number greater than 1, called decay factor. Moreover, the learning rates for all decision modules are $l r$ without any decay. In the above way, we build a multi-decision-based transformer which utilizes rich and diverse information in pre-trained models without damaging the information diversity.

3.3. For Specific Task

In this section, we introduce the specific model implementation for multi-choice MRC tasks such as RACE [2], Dream [43] and ReCO [3]. For each task, we simply plug the task-specific input and decision modules into our multi-decision-based transformer architecture.

RACE & Dream. In this task, each sample contains a passage P of text, a question sentence Q and k candidate answer sequences ${C_{1}, \dots, C_{k}}$ . We concatenate each candidate answer with the corresponding question and passage as the inputs: “ $[[C L S], Q, C_{i}, [S E P], P]$ ”.

At each block, we extract the encoded “ $[C L S]$ ” representations of all sequences and concatenate them as a matrix $H_{n}^{C} \in R^{k \times d}$ to represent the k candidates. Then $H_{n}^{C}$ is fed into the decision module, which is a Feed-Forward Network (FFN), to predict the probability distribution,

(9) $\begin{matrix} P_{n} (Y | S) & = g_{n} (H_{n}, θ_{n}^{d}) \\ = FFN (H_{n}^{C}, θ_{n}^{d}) \\ = s o f t m a x (σ (H_{n}^{C} \cdot W_{1} + b_{1}) \cdot W_{2} + b_{2}) \end{matrix}$

where

σ

is a tanh function,

W_{1} \in R^{d \times d}

W_{2} \in R^{d \times 1}

b_{1} \in R^{d}, b_{2} \in R

are trainable parameters and

θ_{n}^{d} = {W_{1}, W_{2}, b_{1}, b_{2}}

ReCO. ReCO is also a multi-choice task with a passage P, a question Q and three candidate answers ${C_{1}, C_{2}, C_{3}}$ . Compared with RACE, the candidate answer in ReCO is shorter. Therefore, we directly concatenate all candidates with question and passage as the input sequence:

(10) $[[C L S], C_{1}, [C L S], C_{2}, [C L S], C_{3}, [S E P], Q, [S E P], P] .$

At each block, the three $[C L S]$ representations ${\tilde{H}}_{n}^{C} \in R^{3 \times d}$ are passed through an FFN to obtain the predicted probability.

(11) $P_{n} (Y | S) = g_{n} (H_{n}, θ_{n}^{d}) = FFN ({\tilde{H}}_{n}^{C}, θ_{n}^{d})$

4. Experiments

4.1. Implement Details

We use Adam [44] algorithm with a batch size of 48 for optimization and the initial learning rate is set to 2 × 10 $^{- 5}$ . We use a linear warmup for the first 10% of steps followed by a linear decay to 0 for all parameters. The number of training epoch is set to 20 with early stopping. The model is implemented with PyTorch, and pre-trained models are from Huggingface’s transformer library [45]. To alleviate the problem of gradient explosion, the derivatives are clipped in the range of [−2.0, 2.0]. The decay factor $α$ is selected from {2, 3, 5} according to the performance on the development dataset. Finally, we conduct comprehensive experiments on BERT_base [5], RoBERTa_base, RoBERTa_large [6] and DUMA [34]. We run each experiment three times with different random seeds. For the “performance first” manner, we record its average accuracy and standard error, and for the “speed first” manner, we record the best results.

4.2. Datasets

We first evaluate our model on three public multi-choice MRC datasets.

RACE [2] is an MRC dataset collected from English exams for middle and high school. The questions and candidates are generated by experts to evaluate the human agent’s ability in reading comprehension. It contains two subsets with 98,000 questions in total, and includes four types of answers.

ReCO [3] is a recently released large scale Chinese MRC dataset on opinion. The passage and questions are collected from multiple resources. It contains 300,000 samples and includes three types of answers (yes/no/uncertain).

Dream [43] is an English dialogue-based multiple-choice MRC dataset. It is collected from examinations designed by human experts to evaluate the comprehension level of Chinese learners of English, where the passage is an in-depth multi-turn multi-party dialogue. It contains 10,000 questions and 6000 dialogues.

Moreover, we find that our multi-layer structure and learning rate decaying method are not only effective for MRC tasks; therefore, we also evaluate our method on two sentence classification datasets for detailed analysis:

Ag.news is an English sentence classification datasets from [46]. Ag.news contains around 120,000 training samples.

Book Review (https://github.com/autoliuweijie/FastBERT/tree/master/datasets/douban_book_review (accessed on 10 January 2021)) is a Chinese binary sentiment classification dataset. It has 40,000 training samples.

For these classification tasks, we follow the same settings as described in [5] to construct the input sequence and each decision module is a two-layer FFN. These models are trained by the training method described in Section 3.2.

4.3. Evaluate Metrics

For both the MRC task and the classification task, we use accuracy (acc) as the evaluation criteria.

(12) $a c c = \frac{N^{+}}{N}$

where

N^{+}

indicates the number of examples correctly answered by the model, and N denotes the total number of examples in the whole evaluation set.

4.4. Main Results

We first compare our “performance first” model with common BERT-like models on five test datasets. For easy comparison, three existing pre-trained models, BERT_base, RoBERTa_base and RoBERTa_large, are used as baselines and backbones of our models. Table 1 shows the experimental results of our model and the baselines on three MRC datasets, and Table 2 shows the results on two classification datasets. As we can see, in almost all datasets, our model outperforms the baselines by 0.1–2.7% accuracy. These results show that different questions can be handled by the representations in different blocks instead of the top block only. The representations in lower blocks can bring competitive effects as the top layer. We also compare our method with DUMA [34], a representative MRC model with a complex decision module (for a fair comparison, both our model and DUMA are implemented by the BERT_base model). It can be shown that our model also works when incorporating with DUMA, indicating that the intermediate-layer information is useful for different decision modules. These results again show the effectiveness of our method.

Then we evaluate the accuracy and speed of our “speed first” model. Floating-point operations (FLOPs) (https://github.com/Lyken17/pytorch-OpCounter (accessed on 13 January 2021)) is used to measure the computational complexity. Table 3 and Table 4 show the accuracy and FLOPs. Increasing the threshold will no doubt speed up the inference process since the inference is more likely to be terminated at lower layers. As the table shows, compared with the BERT_base model, our model not only accelerates the inference process but also achieves better accuracy on all datasets in most cases. For the MRC task, our model is faster than the BERT_base model by 1.02–1.12 times on RACE, 1.11–1.39 times on ReCO and 1.00–1.02 times on Dream. For the classification task, our model accelerate the inference by 1.29–1.81 times on the Book Review dataset and 2.06–2.82 times on the Ag.news dataset. Moreover, our speed first model with $v = 0.1$ also achieves comparable accuracy to our performance first model ( $v = 0.0$ ) with a higher inference speed.

4.5. Information Type Analysis

In this section, we explicitly analyze which types of reading comprehension task that can be addressed by each layer. Generally speaking, answering different questions requires different information. We collect the results predicted by a 3-block model on the ReCO development dataset and split them into three groups according to the source of the final answer (selected from which layer). Then we randomly select 100 samples from each group, respectively, and ask volunteers to label the information type required by each sample. The information types are mainly taken from ReCO [3] and the findings of Jawahar et al. [8]. ReCO presents seven information types, such as “Lexical Knowledge”, “Syntactic Knowledge”, “Specific Knowledge”. Jawahar et al present four information types, such as “Syntactic Features” and “Semantic Features”. Considering that too fine-grained categories are difficult for volunteers to distinguish, we defined the four information types (“Lexical Knowledge”, “Syntactic Knowledge”, “Semantic Knowledge” and “Specific Knowledge”) in our paper. In order to ensure that the labeled categories are accurate and not affected by other information, we randomly shuffle all samples and only provide the volunteers with passages, questions and options.

Table 5 shows the frequencies of the required information type in each layer. The information types are mainly taken from the popular MRC work [3]. As we can see, more than half of the samples in the first block require lexical knowledge such as synonymy matching. In the second block, the model prefers answering the questions that require syntactic knowledge such as sentence structure information. The third block is devoted to addressing the reasoning tasks such as logical reasoning, casual inference, and so on, which generally need deep semantics knowledge. These results show that the low layer and middle layer are good at solving MRC problems, needing shallow information and the top layer is skilled in addressing deep reasoning tasks. Moreover, there are also some questions that require specific external knowledge, and the third block performs best on them. One possible reason is that the third block contains more background knowledge due to the deeper network.

Some examples requiring different information are shown in Figure 3. The first item shows two passage-question cases requiring lexical knowledge. These two questions are easily answered by matching the similar words/phrases in the question and passage (“before meals” vs. “empty stomach” and “can purchase” vs. “purchase”). As we can see, our model tends to select the first block to answer this kind of question.

The questions requiring syntactic knowledge are shown in the second item. In this kind of problem, the syntactic structures of the question and the evidence are usually different. “Evidence” is the most needed sentences in the passage to answer the question. For instance, in the first example, the question uses the active voice while the evidence “Fever… be caused not by hernia” uses the passive voice. To answer this question, the model needs to find the correct subject–predicate–object tuples in the question and evidence.

The third item shows two examples that requires semantic knowledge. This kind of example usually requires understanding the semantic information of the passages at first, then answers the question with logical reasoning. Take the first question being the third item as an example, the model firstly understands “detention” is a worse outcome, then reasons that “the defendant must participate in the cross-examination after receiving court summons”, and finally gives the answer “no”.

The last item is some cases requiring external knowledge, such as “fish is a kind of seafood”. There are no specific rules for this kind of problem. Generally speaking, the more knowledge the model learns from the training corpus, the better the performance will be.

Moreover, we also show the predicted results of a common BERT baseline in Figure 3. It can be seen that our model outperforms the baseline on questions requiring semantic knowledge and external knowledge. It seems that this result contradicts the above analysis, but it is reasonable. The baseline learns to answer all questions requiring different knowledge at the same layer, leading the model to tend to learn shortcuts (shortcuts are the tricks that use partial evidence to produce answers to the expected comprehension challenges, e.g., co-reference resolution) [47]. Our method exploits a multi-decision structure and a learning rate decaying method that makes the top block attend to learning semantic knowledge and external knowledge, which shows that our approach is effective on these questions.

4.6. Ablation Study

To analyze the influence of the number of blocks and decay factors in a detailed manner, we conduct several comprehensive ablation experiments. Table 6 shows the accuracy of our models (w/BERT_base) with different blocks on three datasets. The 1-block model indicates that we only use the top-layer representation of the BERT_base model without learning rate decaying, which is same as the general BERT_base baseline. As we can see, the performance does not increase monotonously with the increase of the blocks. The model performs best when the number of blocks is 2 or 3, and as the number continues to increase, the accuracy falls. Especially when the number of blocks increases to 12 (use outputs of all layers), the performance reaches the worst. It is even worse than the 1-block model on the RACE dataset.

This is because the representations of adjacent layers are similar. As shown in Figure 4, for any two layers, with the decrease of their distance, their representations tend to be more similar. In other words, for two layers with close distance, the samples they can correctly solve highly overlap. Simply increasing the number of blocks not only fails to address more samples, but also introduces additional noises. In addition, using too many blocks may cause the model to be overfitting. Therefore, it is very important to properly select the number of blocks. According to Table 6 and Figure 4, we think that setting the number of blocks to three is universally effective.

Next, we study the effectiveness of different decay factors in Figure 5. We can see that the model without learning rate decaying ( $α = 1$ ) performs worst, where its third-block (top-layer) accuracy and final accuracy are both lower than the baseline which only uses the top-layer representation. This shows that only adding decision modules at different blocks not only fails in improving the final performance, but also loses the performance of the top layer. In contrast, all our models ( $α > 1$ ) perform well and the top-layer performance even grows slightly. This illustrates that our learning rate decaying method is very effective for maintaining the top-layer performance and improving final accuracy.

Moreover, the performance of the first block (bottom block) falls sharply with the increase of the decay factor $α$ ; the most likely reason is that the learning rate of the first block is smallest, when the final performance in the development dataset reaches the peak, the first block is still in the state of under fitting. As $α$ increases, the learning rate of the first block falls gradually and the degree of under fitting rises, so that the corresponding accuracy goes down. The performance drop does affect the final accuracy, but this does not mean that the first block is completely useless. We try to remove the first-block decision module during testing and find that the accuracy decreases by 0.2%.

Convergence analysis of each layer. Due to the decaying learning rate in our model, the convergence speed of each decision module is different. Figure 6 shows the convergence in accuracy of each block. As the figure shows, the accuracy of the third block first reaches the peak, then is the second block, and the first block is converged at lastly. In addition, we can see that the accuracy of the third block and second block continue at a fixed height after the seventh epoch even though the bottom-block parameters are still updating to fit the bottom-block decision module. This is because the lower-block parameters update slowly, the upper-block parameters can be adjusted in time to maintain the performance. The performance maintenance ability is important for achieving a good final accuracy.

4.7. More Analysis

Information diversity analysis. We calculate the similarity between different blocks to show that our model successfully maintains the information diversity of the pre-trained models. The overall similarity is the average cosine similarity of the vector representations of each two different blocks on development datasets. Generally speaking, lower similarity indicates that the vectors contain more different information—that is, the model has higher information diversity.As shown in Figure 7, there is no doubt that the model without learning rate decaying has the highest similarity on all datasets. The similarity of our model is between the other two models on the ReCO dataset and lower than the baseline model on the RACE dataset, which indicates that our learning rate decaying method is actually effective for maintaining the information diversity. The above experiment has shown that the information of each block is different, which indicates that our model is able to address different reading comprehension samples in different blocks.

5. Conclusions

In this paper, we propose a multi-decision-based transformer model that utilizes the representations at different layers in pre-trained models to answer more types of MRC questions. To the best of our knowledge, this idea is not explored in previous work. In order to prevent the information diversity in different layers from being damaged during fine-tuning, we also propose a learning rate decaying method to weaken the influence of the supervisory signals on lower block parameters. Experimental results on five public datasets show that our model speeds up the inference procedure with considerable accuracy. Moreover, we also find that the low block and middle block of our model are good at solving MRC problems requiring shallow information and the top block is skilled in solving deep reasoning problems.

Author Contributions

Conceptualization, Z.B. and X.W.; methodology, Z.B.; software, Z.B. and J.L.; validation, Z.B. and M.W.; formal analysis, Z.B. and J.L.; investigation, Z.B.; resources, X.W. and C.Y.; writing—original draft preparation, Z.B.; writing—review and editing, X.W., C.Y. and M.W.; visualization, Z.B.; supervision, X.W.; funding acquisition, X.W. and C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by MoE-CMCC “Artificial Intelligenc” Project (No. MCM20190701) and the Major Research 426 Plan of National Natural Science Foundation of China (Grant No. 92067202).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. This is the structure of our multi-decision based transformer model.

Figure 2. This figure shows the optimization method of each layer.

View Image - Figure 3. Examples requiring different information. Our model can explicitly exploit diverse information at different layers, and adaptively select the appropriate layer to answer the question. “✓” indicates the selected layer.

Figure 3. Examples requiring different information. Our model can explicitly exploit diverse information at different layers, and adaptively select the appropriate layer to answer the question. “✓” indicates the selected layer.

View Image - Figure 4. The similarity of the representations of each two layers. This is defined as the cosine similarity between the vector representations of two layers. In this paper, we use the representation for the first “[Forumla omitted. See PDF.]” token as the summary vector of the input sequence. The result is from a 12-block model on ReCO dataset.

Figure 4. The similarity of the representations of each two layers. This is defined as the cosine similarity between the vector representations of two layers. In this paper, we use the representation for the first “[Forumla omitted. See PDF.]” token as the summary vector of the input sequence. The result is from a 12-block model on ReCO dataset.

Figure 5. The accuracy of a three-block model (w/BERT_base) with different decay factors on ReCO development dataset.

Figure 6. The change in accuracy during fine-tuning on ReCO. This is from a three-block model with BERT_base.

View Image - Figure 7. The similarity of a 1-block baseline and two 3-block models on multiple datasets. “w/o LrD” indicates a 3-block model without learning rate decaying. For all models, we use the “[Forumla omitted. See PDF.]” representations in 4th, 8th and 12th blocks to calculate the similarity.

Figure 7. The similarity of a 1-block baseline and two 3-block models on multiple datasets. “w/o LrD” indicates a 3-block model without learning rate decaying. For all models, we use the “[Forumla omitted. See PDF.]” representations in 4th, 8th and 12th blocks to calculate the similarity.

Table 1

Comparison with the baselines on three MRC datasets. For all experiments in this table, we adopt the “performance first” manner and the number of blocks is set to 3. * indicates our re-implementation.

Datasets/Models	RACE			ReCO	Dream
Datasets/Models	All	Middle	High	ReCO	Dream
BERT_base	65.0	71.7	62.3	77.1	63.2
RoBERTa_base	74.1 *	78.1*	72.5 *	79.8 *	70.0 *
RoBERTa_large	83.2	86.5	81.8	79.2	85.0
DUMA	66.1 *	71.8 *	63.7 *	79.1*	62.0 *
Ours
BERT_base	65.5 ± 0.04	71.0 ± 0.07	63.3 ± 0.07	79.8 ± 0.02	63.5 ± 0.18
RoBERTa_base	74.3 ± 0.09	78.2 ± 0.10	72.7 ± 0.11	79.8 ± 0.11	70.1 ± 0.06
RoBERTa_large	83.5 ± 0.21	87.7 ± 0.17	81.8 ± 0.23	81.6 ± 0.11	85.3 ± 0.09
DUMA	65.9 ± 0.18	71.4 ± 0.22	63.6 ± 0.17	79.7 ± 0.02	63.4 ± 0.02

Table 2

Comparison with the baselines on two two classification datasets. * indicates our re-implementation.

Datasets/Models	Ag.news	Book Review
BERT_base	94.5	86.9
RoBERTa_base	95.0 *	87.1 *
RoBERTa_large	95.4 *	88.3 *
Ours
BERT_base	95.0 ± 0.02	87.8 ± 0.04
RoBERTa_base	95.2 ± 0.01	88.3 ± 0.03
RoBERTa_large	95.5 ± 0.04	88.7 ± 0.02

Table 3

Comparison with the BERT_base model on MRC datasets. “our model ( $v = 0$ )” is equivalent to the model with the “performance first” manner; the others adopt the “speed first” manner. v is the Uncertainty threshold.

Datasets/Models	RACE		ReCO		Dream
Datasets/Models	Acc	FLOPs	Acc	FLOPs	Acc	FLOPs
BERT_base	65.0	173,975 M	77.1	43,486 M	63.2	130,464 M
Ours
$v = 0.0$	65.6	173,953 M	79.9	43,487 M	64.1	130,465 M
$v = 0.1$	65.6	171,344 M	79.9	39,306 M	64.1	130,465 M
$v = 0.3$	66.4	165,233 M	79.8	35,111 M	64.1	130,103 M
$v = 0.5$	66.3	155,256 M	79.2	31,200 M	63.9	128,441 M

Table 4

Comparison with the BERT_base model on classification datasets.

Datasets/Models	Book Review		Ag.news
Datasets/Models	Acc	FLOPs	Acc	FLOPs
BERT_base	86.9	21,785 M	94.5	21,785 M
Ours
$v = 0.0$	87.9	21,785 M	95.0	21,785 M
$v = 0.1$	87.8	16,856 M	95.0	10,584 M
$v = 0.3$	87.8	13,942 M	94.6	8705 M
$v = 0.5$	87.7	12,045 M	93.6	7738 M

Table 5

The frequencies of the required information in each block on the ReCO dataset. Some questions may require more than one type of information.

Information Type	1st Layer	2nd Layer	3rd Layer
Lexical Knowledge	51.0	32.0	10.0
Syntactic Knowledge	37.0	46.0	26.0
Semantic Knowledge	13.0	23.0	60.0
Specific Knowledge	1.0	4.0	10.0

Table 6

Comparing the performance of different numbers of blocks.

Datasets/Models	RACE			ReCO	Ag.news
Datasets/Models	All	Middle	High	ReCO	Ag.news
1-block	65.2	71.8	62.4	78.5	94.6
2-block	66.4	72.1	64.0	78.9	94.9
3-block	66.7	71.9	64.5	80.2	95.0
4-block	66.0	71.1	63.9	79.4	94.7
6-block	65.4	70.1	63.5	79.7	94.9
12-block	64.8	70.5	62.4	79.2	94.7

References

1. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. Proceedings of the EMNLP; Austin, TX, USA, 1–5 November 2016; pp. 2383-2392. [DOI: https://dx.doi.org/10.18653/v1/d16-1264]

2. Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E.H. RACE: Large-scale ReAding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017; Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 785-794. [DOI: https://dx.doi.org/10.18653/v1/d17-1082]

3. Wang, B.; Yao, T.; Zhang, Q.; Xu, J.; Wang, X. ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion. Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020; New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 9146-9153.

4. Cong, Y.; Wu, Y.; Liang, X.; Pei, J.; Qin, Z. PH-model: Enhancing multi-passage machine reading comprehension with passage reranking and hierarchical information. Appl. Intell.; 2021; 51, pp. 1-13. [DOI: https://dx.doi.org/10.1007/s10489-020-02168-3]

5. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the NAACL; Minneapolis, MN, USA, 2–7 June 2019; pp. 4171-4186. [DOI: https://dx.doi.org/10.18653/v1/n19-1423]

6. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv; 2019; arXiv: 1907.11692

7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017; Long Beach, CA, USA, 4–9 December 2017; pp. 5998-6008.

8. Jawahar, G.; Sagot, B.; Seddah, D. What Does BERT Learn about the Structure of Language?. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019; Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3651-3657. [DOI: https://dx.doi.org/10.18653/v1/p19-1356]

9. Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; Zhou, X. DCMN+: Dual Co-Matching Network for Multi-Choice Reading Comprehension. Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020; New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 9563-9570.

10. Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; Wang, R. SG-Net: Syntax-Guided Machine Reading Comprehension. Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020; New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 9636-9643.

11. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. Proceedings of the 1st International Conference on Learning Representations, ICLR 2013; Scottsdale, AZ, USA, 2–4 May 2013.

12. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014; Doha, Qatar, 25–29 October 2014; Moschitti, A.; Pang, B.; Daelemans, W. A meeting of SIGDAT, a Special Interest Group of the ACL Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532-1543. [DOI: https://dx.doi.org/10.3115/v1/d14-1162]

13. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018; New Orleans, LA, USA, 1–6 June 2018; Walker, M.A.; Ji, H.; Stent, A. Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 1 (Long Papers), pp. 2227-2237. [DOI: https://dx.doi.org/10.18653/v1/n18-1202]

14. Pota, M.; Ventura, M.; Fujita, H.; Esposito, M. Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets. Expert Syst. Appl.; 2021; 181, 115119. [DOI: https://dx.doi.org/10.1016/j.eswa.2021.115119]

15. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018; Available online: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf (accessed on 11 March 2022).

16. Guarasci, R.; Silvestri, S.; De Pietro, G.; Fujita, H.; Esposito, M. Assessing BERT’s ability to learn Italian syntax: A study on null-subject and agreement phenomena. J. Ambient. Intell. Humaniz. Comput.; 2021; [DOI: https://dx.doi.org/10.1007/s12652-021-03297-4]

17. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1, 9.

18. Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv; 2019; arXiv: 1904.09223

19. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Proceedings of the 8th International Conference on Learning Representations, ICLR 2020; Addis Ababa, Ethiopia, 26–30 April 2020.

20. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019; Vancouver, BC, Canada, 8–14 December 2019; pp. 5754-5764.

21. Hirschman, L.; Light, M.; Breck, E.; Burger, J.D. Deep Read: A Reading Comprehension System. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, University of Maryland; College Park, MD, USA, 20–26 June 1999; Dale, R.; Church, K.W. Association for Computational Linguistics: Stroudsburg, PA, USA, 1999; pp. 325-332. [DOI: https://dx.doi.org/10.3115/1034678.1034731]

22. Riloff, E.; Thelen, M. A rule-based question answering system for reading comprehension tests. Proceedings of the ANLP-NAACL 2000 Workshop: Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems; Association for Computational Linguistics: Stroudsburg, PA, USA, 2000.

23. Hao, X.; Chang, X.; Liu, K. A Rule-based Chinese Question Answering System for Reading Comprehension Tests. Proceedings of the 3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007); Kaohsiung, Taiwan, 26–28 November 2007; Liao, B.; Pan, J.; Jain, L.C.; Liao, M.; Noda, H.; Ho, A.T.S. IEEE Computer Society: Los Alamitos, CA, USA, Washington, DC, USA, Brussels, Belgium, Tokyo, Japan, 2007; pp. 325-329. [DOI: https://dx.doi.org/10.1109/IIH-MSP.2007.59]

24. Wang, X.-J.; Bai, Z.-W.; Li, K.; Yuan, C.-X. Survey on Machine Reading Comprehension. J. Beijing Univ. Posts Telecommun.; 2019; 42, pp. 1-9.

25. Catelli, R.; Casola, V.; Pietro, G.D.; Fujita, H.; Esposito, M. Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification. Knowl. Based Syst.; 2021; 213, 106649. [DOI: https://dx.doi.org/10.1016/j.knosys.2020.106649]

26. Hermann, K.M.; Kociský, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015; Montreal, QC, Canada, 7–12 December 2015; pp. 1693-1701.

27. Seo, M.J.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bidirectional Attention Flow for Machine Comprehension. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017; Toulon, France, 24–26 April 2017.

28. Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017; Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R.; Kan, M. Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1: Long Papers, pp. 1870-1879. [DOI: https://dx.doi.org/10.18653/v1/P17-1171]

29. Yu, A.W.; Dohan, D.; Luong, M.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. Proceedings of the 6th International Conference on Learning Representations, ICLR 2018; Vancouver, BC, Canada, 30 April–3 May 2018.

30. Dhingra, B.; Liu, H.; Yang, Z.; Cohen, W.W.; Salakhutdinov, R. Gated-Attention Readers for Text Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017; Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R.; Kan, M. Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1: Long Papers, pp. 1832-1846. [DOI: https://dx.doi.org/10.18653/v1/P17-1168]

31. Xu, Y.; Liu, J.; Gao, J.; Shen, Y.; Liu, X. Dynamic fusion networks for machine reading comprehension. arXiv; 2017; arXiv: 1711.04964

32. Zhu, H.; Wei, F.; Qin, B.; Liu, T. Hierarchical Attention Flow for Multiple-Choice Reading Comprehension. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18); New Orleans, LA, USA, 2–7 February 2018; McIlraith, S.A.; Weinberger, K.Q. AAAI Press: Menlo Park, CA, USA, 2018; pp. 6077-6085.

33. Wang, S.; Jiang, J. Machine Comprehension Using Match-LSTM and Answer Pointer. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017; Toulon, France, 24–26 April 2017.

34. Zhu, P.; Zhao, H.; Li, X. DUMA: Reading comprehension with transposition thinking. arXiv; 2020; arXiv: 2001.09415[DOI: https://dx.doi.org/10.1109/TASLP.2021.3138683]

35. Ran, Q.; Lin, Y.; Li, P.; Zhou, J.; Liu, Z. NumNet: Machine Reading Comprehension with Numerical Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019; Hong Kong, China, 3–7 November 2019; Inui, K.; Jiang, J.; Ng, V.; Wan, X. Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2474-2484. [DOI: https://dx.doi.org/10.18653/v1/D19-1251]

36. Bai, Z.; Li, K.; Chen, J.; Yuan, C.; Wang, X. RAIN: A Relation-based Arithmetic model with Implicit Numbers. Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC); Chengdu, China, 11–14 December 2020; pp. 2370-2375.

37. Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018; London, UK, 19–23 August 2018; pp. 1930-1939. [DOI: https://dx.doi.org/10.1145/3219819.3220007]

38. Shen, Y.; Huang, P.; Gao, J.; Chen, W. ReasoNet: Learning to Stop Reading in Machine Comprehension. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Halifax, NS, Canada, 13–17 August 2017; pp. 1047-1055. [DOI: https://dx.doi.org/10.1145/3097983.3098177]

39. Yu, J.; Zha, Z.; Yin, J. Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019; Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1: Long Papers, pp. 2241-2251. [DOI: https://dx.doi.org/10.18653/v1/p19-1217]

40. Zhou, Q.; Wang, X.; Dong, X. Differentiated Attentive Representation Learning for Sentence Classification. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018; Stockholm, Sweden, 13–19 July 2018; pp. 4630-4636. [DOI: https://dx.doi.org/10.24963/ijcai.2018/644]

41. Liu, W.; Zhou, P.; Wang, Z.; Zhao, Z.; Deng, H.; Ju, Q. FastBERT: A Self-distilling BERT with Adaptive Inference Time. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020; Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6035-6044. [DOI: https://dx.doi.org/10.18653/v1/2020.acl-main.537]

42. Xin, J.; Tang, R.; Lee, J.; Yu, Y.; Lin, J. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020; Online, 5–10 July 2020; Jurafsky, D.; Chai, J.; Schluter, N.; Tetreault, J.R. Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2246-2251. [DOI: https://dx.doi.org/10.18653/v1/2020.acl-main.204]

43. Sun, K.; Yu, D.; Chen, J.; Yu, D.; Choi, Y.; Cardie, C. DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension. Trans. Assoc. Comput. Linguistics; 2019; 7, pp. 217-231. [DOI: https://dx.doi.org/10.1162/tacl_a_00264]

44. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015; San Diego, CA, USA, 7–9 May 2015.

45. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M. et al. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020- Demos; Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 38-45. [DOI: https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6]

46. Zhang, X.; Zhao, J.J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. Proceedings of the NeurIPs; Montreal, QC, Canada, 7–12 December 2015; pp. 649-657.

47. Lai, Y.; Zhang, C.; Feng, Y.; Huang, Q.; Zhao, D. Why Machine Reading Comprehension Models Learn Shortcuts?. Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021; Online Event, 1–6 August 2021; Zong, C.; Xia, F.; Li, W.; Navigli, R. Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 989-1002. [DOI: https://dx.doi.org/10.18653/v1/2021.findings-acl.85]

Word count: 8102

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Answering different multi-choice machine reading comprehension (MRC) questions generally requires different information due to the abundant diversity of the questions, options and passages. Recently, pre-trained language models which provide rich information have been widely used to address MRC tasks. Most of the existing work only focuses on the output representation at the top layer of the models; the subtle and beneficial information provided by the intermediate layers is ignored. This paper therefore proposes a multi-decision based transformer model that builds multiple decision modules by utilizing the outputs at different layers to confront the various questions and passages. To avoid the information diversity in different layers being damaged during fine-tuning, we also propose a learning rate decaying method to control the updating speed of the parameters in different blocks. Experimental results on multiple publicly available datasets show that our model can answer different questions by utilizing the representation in different layers and speed up the inference procedure with considerable accuracy.

Details

Title

Exploiting Diverse Information in Pre-Trained Language Model for Multi-Choice Machine Reading Comprehension

Author

Bai, Ziwei

; Liu, Junpeng; Wang, Meiqi; Yuan, Caixia; Wang, Xiaojie

First page

3072

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app12063072

ProQuest document ID

2642347195

Exploiting Diverse Information in Pre-Trained Language Model for Multi-Choice Machine Reading Comprehension

Jump to:

Full text

Abstract

Details

Suggested sources