1. Introduction
Semantic similarity evaluation is used in various fields such as machine translation, information retrieval, question-and-answering, and plagiarism detection [1,2,3,4]. Semantic similarity is measured for two texts, regardless of the length, the location of the corresponding words, and their contexts. These semantic similarity evaluations cost a lot of time and money in order for a person to judge directly. To solve this problem, past studies have used bilingual evaluation understudy (BLEU) [5] or metric for evaluation of translation with explicit ordering (METEOR) [6]. However, these are both vocabulary-based similarity evaluation methods, and it is difficult to grasp similar expressions, and not just the same ones. Recent studies [2,3,4] have shown good performance using artificial neural networks, such as convolutional neural networks (CNN), long short-term memory (LSTM) and gated recurrent unit (GRU), to evaluate the semantic similarity of sentence pairs.
These methods [2,3,4] have used the output of the last hidden state, which represents a whole sentence, to evaluate similarity. However, if sentence similarity is judged using only information representing the entire sentence, there is a limitation, in that that it is difficult to properly reflect the effect of similarity of local meaning [7,8]. If a sentence is composed of multiple sentences, it is very important to estimate the meaning similarity between individual sentence pairs. Let us take an example below.
Sentence 1: I’m a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years?
Sentence 2: I am a 19 years old guy How can I become a billionaire in the next 10 years?
The two sentences above have similar expressions overall, but each second sentence has a distinctly different meaning. Therefore, in this study, we propose a model that uses not only global features, i.e., entire information on sentences, but also local features, i.e., local information on sentences.
Recently, the capsule network [9] has been proposed that showed good performance using local features in the image classification task. This approach vectorizes the properties of the entity, considering spatial relationships. A capsule network has also shown good performance in text classification tasks [10,11,12]. A capsule network finally considers the spatial relationship of local features, and more elaborated learning is achieved through dynamic routing that sends input values to the appropriate upper-level capsules. However, in the case of the semantic similarity task, even if the order of words in the two texts were different, it cannot be asserted that the meaning of the two texts is different. Therefore, in this study, the use of dynamic routing is avoided, and the individual features that constitute the capsule are used as local features.
We propose a model that uses global and local features together for semantic similarity evaluation of sentence pairs. The two sentences are inputted into bidirectional LSTM (Bi-LSTM), which has a forward-backward LSTM structure [13], and the last hidden states extracted through this are used as global features. The sequence of hidden states extracted from Bi-LSTM is entered into self-attention to express the sentence’s context information [8,14,15,16]. The features from Bi-LSTM are entered into the capsule network to which dynamic routing is not applied. The capsule vectors are used as local features. Finally, the similarity is evaluated by using the Manhattan distance for the global and local features of the two sentences. All neural networks, except self-attention, have a Siamese network structure.
The contributions of this study are as follows:
To evaluate the semantic similarity of sentence pairs, we propose a model that uses global features, entire sentence information, and local features, localized sentence information, simultaneously. The proposed model can adjust whether to focus more on global information or local information, and it is seen that the accuracy is higher than the existing models that use only global information.
In this study, the effect of dynamic routing on similarity evaluation was investigated. Since the similarity of the meaning of sentences is relatively free in the positioning of the corresponding phrases, it was found that dynamic routing hindered the correct evaluation. In addition, experiments were conducted on both English and Korean datasets to prove the language independency of the model proposed.
This paper is composed as follows. Section 2 briefly describes the models of former studies and our proposed model in evaluating sentence similarity. Section 3 minutely describes the proposed model for global and local feature extraction. Section 4 describes the data used in this study, the hyperparameters used in the experiment, and the experimental results. Section 5 discusses the results. Finally, Section 6 discusses the summary of this study and proposes future studies.
2. Related Works
Sentence similarity evaluation is used in various fields, and recently, it has shown good performance by utilizing deep learning [2,3,4]. Ref. [2] evaluates the similarity by applying the Siamese network structure to LSTM [13], a family of recurrent neural networks (RNN) that perform well in sequential data. The Siamese network is a structure in which two inputs are entered simultaneously into a single neural network [2]. Words constituting sentences are represented as vectors through word embedding. The two sentences composed of word vectors are entered into the LSTM and learning proceeds. The similarity is evaluated using Manhattan distance calculation, which is , where and are the output vectors of the last hidden states representing each sentence.
Ref. [3] evaluates the similarity by applying the Siamese network structure to CNN and LSTM. Ref. [3] converts two sentences into two vectors. Then, local features are extracted through CNN, which extracts information on adjacent words. The extracted local features are entered as an LSTM, and LSTM is trained. The similarity between the two sentences is calculated by applying the Manhattan distance to the LSTM output vectors.
Ref. [4] evaluates the similarity by applying the Siamese network structure to group CNN and bidirectional GRU (Bi-GRU), a family of RNNs. Ref. [4] converts two sentences into two vectors. Then, multi-local features are extracted through group CNN, and the most representative local features are obtained by applying max-pooling to the multi-local features. These extracted local features are concatenated with the vector, then it enters the input of Bi-GRU. The similarity is determined by applying Manhattan distance to the Bi-GRU output vectors.
To consider whole sentence information, our proposed model in this study uses the last hidden state as a global feature by applying the Siamese network structure to Bi-LSTM. Additionally, to use localized sentence information representing contextual information, after applying self-attention the feature extracted from the capsule network is used as a local feature. Finally, sentence similarity is evaluated by applying Manhattan distance to global and local features.
3. Materials and Methods
Figure 1 shows the overall structure of the proposed model including word embedding, Bi-LSTM, self-attention, and a capsule network. In this model, global features are extracted from Bi-LSTM and local features from the capsule network. Global and local features are used to evaluate semantic similarity using the Manhattan distance.
All neural networks, except self-attention, have the Siamese network structure. The Siamese network is a structure in which two input values enter a single neural network, sharing the weight of the neural network. Features created through the trained Siamese network generate vectors with short distances if the input sentences are similar, and vectors with long distance if they are not similar [2,3,4].
3.1. Word Embedding
In the field of natural language processing, words constituting text are expressed as vectors through word embedding [2,3,4]. Word embedding refers to expressing a word’s meaning as dense vectors so that computers can understand human words [17,18]. This stems from the assumption in distributional semantics that words appearing in similar contexts have similar meanings. Based on this, words with similar meanings are expressed in similar vectors [19].
In this study, Word2Vec [17,18], one of the word-embedding techniques, is used to express words as vectors. Word2Vec is a method to vectorize words using the target word and its surrounding words in a predetermined window size. Word2Vec updates learning weights by maximizing the dot product of the target word and its surrounding words. In this way, words are expressed as vectors representing their meanings.
3.2. Bidirectional Long Short-Term Momory
RNN is a neural network with good performance for sequential data processing, but learning ability is poor due to gradient vanishing, in which the gradient disappears during backpropagation, as the timestep, which refers to the order of input data, increases [7]. This problem is called long-term dependency. To solve the shortcomings of RNN, LSTM has been proposed. LSTM has a structure in which a cell state () that can have long-term memory is added to the hidden state () of RNN and solves long-term dependency [7]. To obtain and , LSTM uses an input gate () a forget gate (), an output gate (), and a cell () that analyzes of the previous timestep and () of the input information of the current timestep. is a gate that determines how much information of is preserved by calculating the Hadamard product of and . is a gate that determines how much information of is also reflected by calculating the Hadamard product of and . is a gate that determines , the current hidden state, by calculating the Hadamard product of and . The equations of LSTM are as follows:
In Equation (1), the sigmoid function in , , and results in adjusting how much of that information will be reflected. , , , and are the learning weight matrices connected to , , , , and are connected to , and , , , and are biases of the corresponding layers. is updated by adding the Hadamard product of and and the Hadamard product of and . Finally, is created by Hadamard product of and .
(1)
Bi-LSTM has a structure with a forward LSTM that sequentially approaches from to and a backward LSTM that sequentially approaches from to . Hidden states sequentially processed in Bi-LSTM are defined as follows:
Equation (2) is the concatenation of the hidden state of the forward LSTM of timestep and the hidden state of the backward LSTM, where refers to the length of the sequence. In Equation (3), is the concatenation of the hidden state of all timesteps. This is then inputted to the attention mechanism.
(2)
(3)
The global features that contain the information of the whole sentence are defined as follows:
Equation (4) refers to the concatenation of the last hidden state of Bi-LSTM. The global features are used with local features, to be mentioned in the section on the capsule network, to further improve the accuracy of the model in the future.
(4)
3.3. Attention Mechanism
The attention mechanism [8,14,15,16] is a method of correlating words in sentences. In this study, self-attention using only one sentence is used, and , which is the hidden state of Bi-LSTM, is used as an input value. Self-attention is defined as follows:
In Equation (5), is the hidden state of the current timestep, and is the hidden state of any timestep including the current one. and refer to the learning weights of the corresponding timesteps i and j, and is the bias vector. In Equation (6), is a learning weight that calculates the importance of each word in terms of the current word. eij in Equation (6) is a scalar value representing the importance of hj in terms of hi. The importance of words is normalized to a probability value by Equation (7). Then, containing context information is extracted by Equation (8) By multiplying the hidden value of each word by the importance probability scalar and adding all the results, the final importance vector of a given word can be obtained. C is a matrix of when the length of the sentence is and the number of units of Bi-LSTM is .
(5)
(6)
(7)
(8)
3.4. Capsule Network
In this study, a capsule network using CNN is used to extract local features [10,11,12,13]. The capsule network consists of two CNNs, and . In this study, extracted from self-attention is used as the input value of the capsule network. Conv1 proceeds as follows:
In Equation (9), refers to the activation function, refers to the index of the word, and refers to the kernel size. refers to a learning weight for the convolution filter and has a matrix of , and refers to the bias vector. is a typical CNN, extracting combined information of adjacent words as much as does. Through this, features for are created. Finally, has a matrix of when the length of the sentence is and the number of filters that increase the amount of learning is .
(9)
uses the features created in Conv1 as an input value. is processed by Equation (9), like , but it uses the filter size corresponding to the entire size of input values to subdivide the expression of the entire sentence [13,20]. The features of extracted in this way have a matrix of by the number of filters . After that, PrimaryCaps with dimension is created by reconstructing the number of dimensions of the function. Here, PrimaryCaps means capsules that have subdivided the entire sentence information into pieces. To normalize the size of the vector, PrimaryCaps uses squash, a nonlinear function [10,11,12,13]. Squash is defined as follows:
In Equation (10), means the one capsule. In this study, PrimaryCaps with squash applied are used as local features.
(10)
3.5. Similarity Measure
Global features extracted from Bi-LSTM and local features extracted from the capsule network are generated for each of the two sentences. In this study, the similarity between two sentences is evaluated by applying the Manhattan distance for global features and local features. The Manhattan distance is close to 1 if both vectors are similar and close to 0 if they are dissimilar [2,3,4]. The similarity between global and local features is defined as follows:
In Equations (11) and (12), and mean two sentences. In Equation (11), is a global feature using the last hidden state of Bi-LSTM. In Equation (12), means PrimaryCaps, and means the number of dimensions of PrimaryCaps.
(11)
(12)
The final similarity value is calculated by the alpha weight () as follows:
In Equation (13), is a weight that can adjust which information to focus on among global features and local features. The value of is determined experimentally. Finally, the predicted of 0.5 or higher determines similarity, and the others determine to non-similarity.
(13)
4. Experiments
In the experiment, the improved accuracy of the model proposed in this study is compared with that of the past models. Experiments are conducted in English and Korean to show that our model develops accuracy regardless of languages. To directly check the effect of dynamic routing, we have created models with or without dynamic routing.
4.1. Dataset
In this experiment, a corpus for learning representation of words and a corpus for learning similarity of given sentences are also needed. In order to learn the representations of English words, we use the Google News corpus [21]. For Korean, we use the raw Korean sentence corpus produced and distributed by Kookmin University in Korea [22]. Individual words are embedded by Word2Vec. These corpora have shown good results in existing sentence similarity evaluation research [2,3,23]. English and Korean words are represented as 300-dimensional vectors, and when the size of vocabulary appearing in a sentence is L, they have a matrix of . The next two subsections show the status of two datasets used for learning similarity evaluation models in English and Korean, respectively.
4.1.1. English Dataset
The English dataset used in this study is Quora Question Pairs [24]. A pre-processing is carried out in which the stopword and all special characters are deleted from all sentences, and upper case letters are changed to lower case letters. Besides, since the sentences could have a very short length after preprocessing, we collected sentences with more than five words. Table 1 is an example of English sentence pairs and their labels collected for the experiment.
In Table 1’s label, 0 means that sentence pairs do not have similar meanings, and label 1 means that the pairs are similar in meaning. The number of data items for label 0 is 50,000, and the number of data items for label 1 is also 50,000, so a total of 100,000 data are used in this study. Thereafter, the ratio of the training set: validation set: test set is set to 8:1:1.
4.1.2. Korean Dataset
The Korean dataset used in this study is collected through the translation of Quora Question Pairs [24] by Google translator [25], Naver question pairs [26], Exobrain Korean paraphrase corpus [27], and German translation pairs developed by Hankuk University of Foreign Studies [28]. To use the elaborate Korean dataset, Naver Korean spell checker [29] is used. The experiment is conducted after dividing words into morphemes which are the smallest unit of words with meaning by using Kkma, a Korean morphological analyzer [30]. Table 2 is an example of Korean sentence pairs and labels collected for the experiment.
In Table 2, in English is a pair of sentences translated from Korean, and the label of the sentence pair is marked as 1 if the meaning is similar and 0 if not similar. The number of individual data for label 0 is 5500, and the number of data items for label 1 is 5500, so a total of 11,000 data are used in this study. Thereafter, the ratio of the training set: validation set: test set is set to 8:1:1.
4.2. Hyperparameters
Table 3 shows the hyperparameters for each neural network used in the experiment.
4.3. Accuracy Comparison According to
Figure 2a,b show the performance comparison according to of Equation (13) in Section 3.5. Figure 2a shows the accuracy of the English dataset and Figure 2b shows the accuracy of the Korean. In the case of English, it can be seen that the higher the proportion of global features, the higher the accuracy. Its peak is at 80% of the weight of the global features. In Korea, accuracy also increases according to the weight of global features. However, the accuracy shows the peaks when the weight of the two features is around 5:5. This is because the proportion of local features is relatively higher in the case of Korean, with more free word order than English.
4.4. Result
Table 4 shows the accuracy of various models, including the previously proposed semantic similarity evaluation models. Acc shows the average accuracy of 10 experiments using Manhattan distance. Here, English Acc means the accuracy using the English dataset, and Korean Acc means accuracy using the Korean dataset.
In Table 4, No. 1 to No. 4 are models using only global features. No. 1 to No. 3 are models proposed in the previous studies [2,3,4], and No. 4 and No. 5 are models using simple Bi-LSTM and CNN, respectively. These models are arbitrarily built to compare the accuracy of the RNN series model and the CNN series model. RNN shows higher accuracy than CNN in this task. No. 6 to No. 11 are models using only local features, i.e., the capsule network. No. 6 and No. 7 are experiments to verify the effect of dynamic routing when using a capsule network. In both English and Korean, it can be seen that dynamic routing causes performance degradation. No. 8 to No. 11 are models applying the capsule network to RNN-related models, but they do not use the last hidden state of LSTM. No. 10 and No. 11 are experiments in which self-attention is added to No. 8 and No. 9, respectively. Finally, No. 12 is the proposed model using both global and local features.
5. Discussion
Existing studies [2,3,4] have tried to estimate the similarity of sentences mainly by using global features. We were able to improve accuracy by up to 8.32%p (Korean) by also utilizing local features. In addition, it was shown that an accuracy of up to 3.03%p can be achieved by simply changing the LSTM to Bi-LSTM (compare No. 1 to No. 4). A model capturing forward information and backward information simultaneously is more effective. Comparing No. 1 and No. 2, and No. 3 and No. 4, it can be seen that simply and mechanically combining the global features and the local features does not help improve performance. Rather, it only decreases the accuracy.
A capsule network is designed to solve the problem of CNN’s pooling method, but rather CNN’s pooling shows higher accuracy not only for the English dataset but also for the Korean dataset (see No. 5 and No. 7. in Table 4). Furthermore, we can confirm that the capsule network model, No. 6, which does not use dynamic routing, has higher accuracy than the CNN based models, No. 5 and No. 7. Through this, it can be confirmed that dynamic routing considering the spatial relationship of local features is inappropriate for use in the task of semantic similarity estimation. Unlike images in which the relative positions of the pixel chunks are fixed to some extent, the position of words or phrases in a sentence are relatively free.
As shown from No. 8 to No. 11, which are RNN-related models, it can be confirmed that the accuracy of the capsule network model with self-attention applied to Bi-LSTM is the best. In No. 4, No. 9, and No. 11, in Korean the degree of accuracy improvement by the capsule network and the self-attention is higher than in English. The semantics of Korean sentences is concentrated in a more local area than in English. In order to grasp the entire meaning of a Korean sentence, it is important to accurately understand the local meanings.
No. 12 is a proposed model in which of Equation (13) is 0.8 in English and 0.5 in Korean. This can be seen that of the Korean dataset concerning global features is lower than of the English dataset. From the fact that the larger the alpha value, the higher the proportion of global features, we can infer that Korean depends more on local features since it has a freer order of words than English.
6. Conclusions
In this study, the semantic similarity of sentence pairs is evaluated using global and local sentence information. The proposed model consists of Bi-LSTM, self-attention, and capsule network, and all neural networks, except self-attention, apply the Siamese network. This model extracts global features through Bi-LSTM and extracts local features through the capsule network. The extracted global and local features are used when evaluating the semantic similarity of sentence pairs using Manhattan distance. This allows not only global and local features to be considered together but also the adjustment of which information to focus more closely on by . As a result of comparing the existing models, models using only global features, models using only local features, and the model using both global and local features at the same time, it can be seen that the accuracy of the model using both global and local features at the same time is higher.
However, this study has a limitation. Even though plays a very important role in determining the weight of global and local features, we have not been able to come up with a universal methodology for obtaining . We will find a way to integrate this into the entire network to obtain the optimal through learning. We will also apply another language model instead of Word2Vec to obtain more sophisticated sentence pair representations, and update the current network structure after analyzing the strengths and weaknesses to further improve the accuracy of the model.
Author Contributions
Conceptualization, T.-S.H. and Y.-S.K.; Data curation, T.-S.H.; Formal analysis, T.-S.H., J.-D.K. and C.-Y.P.; Methodology, T.-S.H.; Project administration, Y.-S.K.; Resources, T.-S.H.; Software, T.-S.H.; Supervision, Y.-S.K.; Validation, J.-D.K. and C.-Y.P.; Writing—original draft, T.-S.H.; Writing—review and editing, Y.-S.K. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2C2006010).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here:
Conflicts of Interest
The authors declare no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figures and Tables
Figure 1. Model of global and local features extraction for semantic similarity evaluation.
Figure 2. (a) English dataset accuracy according to α; (b) Korean dataset accuracy according to α.
Example of English sentence pairs and labels.
Sentence 1 | Sentence 2 | Label |
---|---|---|
I’m a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years? | I am a 19 years old guy. How can I become a billionaire in the next 10 years? | 0 |
What are the life lessons that Batman teaches us? | What are the life lessons you can learn from the dark knight? | 1 |
Example of Korean sentence pairs and labels.
Language | Sentence 1 | Sentence 2 | Label |
---|---|---|---|
Korean | 참고 인내하며 때를 기다려야 하는 날입니다. | 일이 잘 풀릴 것 같은 날입니다. | 0 |
English | It is a day when you have to be patient and wait for the time. | It’s a day that seems to be going well. | |
Korean | 가장 저렴한 방법으로 치아 미백 효과를 낼 수 있는 방법은? | 치아를 미백할 수 있는 저렴하고 효율적인 방법은? | 1 |
English | What is the cheapest way to produce teeth whitening effect? | What is an inexpensive and efficient way to whiten teeth? |
Hyperparameters used in the experiment.
Features | Description | Hyperparameter |
---|---|---|
Global | The number of Units of bidirectional long-short term memory (Bi-LSTM) | 256 |
Local | Rectified Linear Unit | |
3 | ||
256 | ||
None | ||
The total size of the Input | ||
256 | ||
The number of dimensions used in the PrimaryCaps | 8 |
Model accuracy comparison according to experiments.
No. | Model | English Acc | Korean Acc |
---|---|---|---|
1 | LSTM [2] | 80.99% | 87.25% |
2 | Convolutional neural networks (CNN) + LSTM [3] | 80.44% | 83.76% |
3 | Group CNN + bidirectional gated recurrent unit (Bi-GRU) [4] | 82.18% | 89.52% |
4 | Bi-LSTM | 82.9% | 90.28% |
5 | CNN + Max-Pooling | 80.58% | 88.26% |
6 | Capsule Network | 81.57% | 88.85% |
7 | Capsule Network + Dynamic Routing | 78.82% | 86.48% |
8 | LSTM + Capsule Network | 82.07% | 90.25% |
9 | Bi-LSTM + Capsule Network | 82.15% | 90.56% |
10 | LSTM + Self-Attention + Capsule Network | 82.78% | 90.61% |
11 | Bi-LSTM + Self-Attention + Capsule Network | 82.9% | 91.41% |
12 | Proposed Model | 83.51% | 92.08% |
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Semantic similarity evaluation is used in various fields such as question-and-answering and plagiarism testing, and many studies have been conducted into this problem. In previous studies using neural networks to evaluate semantic similarity, similarity has been measured using global information of sentence pairs. However, since sentences do not only have one meaning but a variety of meanings, using only global information can have a negative effect on performance improvement. Therefore, in this study, we propose a model that uses global information and local information simultaneously to evaluate the semantic similarity of sentence pairs. The proposed model can adjust whether to focus more on global information or local information through a weight parameter. As a result of the experiment, the proposed model can show that the accuracy is higher than existing models that use only global information.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 Department of Convergence Software, Hallym University, Chuncheon-si 24252, Gangwon-do, Korea;