Cloze-Style Data Augmentation for Few-Shot Intent

Full text

Turn on search term navigation

1. Introduction

Intent recognition is a branch of text classification in natural language understanding (NLU), focusing on identifying users’ potential purposes from their utterances. For instance, it could recognize the intention “bill balance” from the utterance “What is my bill for water and electricity?”. Intent recognition plays an important role in various downstream tasks, such as dialog systems [1,2] and recommender systems [3,4].

However, in real-world applications, new intent categories emerge rapidly and only have limited well-labeled data, making it difficult to directly apply them to optimize existing deep neural networks. The networks always include a pretrained language model as the backbone to encode the text data into continuous low-dimensional vectors, such as BERT [5] and RoBERTa [6]. Such models always have complex architectures with many layers, and therefore have a considerable number of parameters. If a small amount of training data is directly leveraged for updating the parameters of the deep neural network model based on the traditional training paradigm, the model will only be able to capture the local features, leading to a lack of generalization ability and the overfitting problem—i.e., the good performance on the training set and poor performance on the test set. To handle such a problem, a few-shot learning (FSL) strategy was proposed by Snell et al. [7], Vinyals et al. [8], Wang et al. [9] to assist the model to obtain generalizability with only limited data. The researchers mentioned above regard the few-shot intent recognition as a meta-learning problem. It simulates few-shot scenarios through a series of small meta tasks. This method is widely employed in the field of few-shot text classification tasks [10], such as relation classification [11,12], event detection [13,14,15,16] and intent detection [17,18].

A cornerstone challenge is that meta learning-based few-shot learning methods can still easily be trapped into the dilemma of overfitting on the biased distribution which is caused by the limited training samples [19]. Some researchers have attempted to prevent the overfitting problem via data augmentation methods. One of the key ideas is back-translation [20]: translating other language representations of the input text into the initial language. Another common approach is to leverage an external knowledge base to obtain expressions that are semantically similar to the original sentence [21,22]. Specifically, Dopierre et al. [22] introduced several knowledge bases to generate diverse paraphrased sentences of the original inputs rather than reordering the tokens. However, though back-translation can generate diverse expressions for the same semantics, it has poor performance in short texts. The generated expressions are often similar or even identical to the original input sentences. Concerning the paraphrase generation method, we argue that it is not suitable for text augmentation in all domains, since it is not always possible to find a corresponding external knowledge base.

To address the aforementioned issues, we propose an approach for data augmentation which is suitable for short texts, without any knowledge-based participation. In this work, we consider the pre-trained language model itself as a knowledge-base, since it has been trained on a large text corpus and thus can perform some simple tasks. Taking the BERT-derived model as an example, one of the pre-training tasks it performs on the corpus is implementing the masked language model (MLM), i.e., predicting the token of the given sentence at the “[MASK]” position, which is a blank to be filled. Inspired by such a task, we constructed a similar form of cloze task for data augmentation to make full use of the knowledge of the pre-trained language model itself. We employ the hidden state vector of “[MASK]” token predicted by the model as the data augmentation result of the input sentence, rather than a true sentence consisting of a series of tokens. Furthermore, to make the results of data augmentation not meaningless noise, we leverage an unsupervised learning method to make it semantically similar to the original input. Next, to maximize the use of the small number of samples in a meta task, we employ a supervised contrastive learning strategy to make samples in the same category closer and the ones in different categories farther in the embedding space.

To verify the effectiveness of our proposal, we conducted comprehensive experiments on the public datasets CLINC-150 [23] and BANKING-77 [24] with two types of meta tasks. The experiment results demonstrate that our proposal achieves obvious improvements over competitive baselines. In addition, the ablation study further verified the effectiveness of both unsupervised and contrastive learning strategies in our proposal.

In summary, the main contributions in this work can be briefly listed as follows:

We only employ the knowledge of the pre-trained language model itself to augment the data, avoiding the dependence on external knowledge.
We use an unsupervised learning strategy to leverage original input samples for meaningful data augmentation.
We apply contrastive learning at different granularity to make full use of the limited amount of available instances in a meta task.

2. Related Work

2.1. Few-Shot Intent Recognition

Intent recognition is an important task in the field of natural language understanding (NLU) [25] and is well studied in multiple applications, such as conversational agents [24], in the task-oriented dialogue systems [26]. Various neural architectures have been applied to the intent recognition task. For instance, Sarikaya et al. [27] proposed to leverage recurrent neural networks (RNN) to deal with the task. In addition, Liu and Lane [28] utilized an attention-based method, and Chen et al. [29] attempted to settle the problem with transformer models. As a pre-task of slot filling, intent recognition can help the dialogue systems to clarify a user’s intention in each turn and then respond by marking the word containing the intention in the utterance [30].

Since the task-oriented dialogue systems are domain-specific, intent recognition is a challenging task because of the well-labeled data scarcity and the number of categories it involves [18,31]. Although the traditional methods for intent recognition can obtain satisfactory performance in conventional application scenarios, the process of labeling data can no longer reach the speed at which new intent categories emerge, which leads to a data-starved situation for traditional methods [32]. Therefore, there is only a small amount of well-labeled data in a considerable number of categories, which is called a low-resource dilemma. When facing the low-resource dilemma, traditional methods are easily trapped in the overfitting morass, manifesting as high accuracy in the training phase but much worse performance in the testing phase. Several existing few-shot learning approaches try to address this problem, mainly from two aspects, i.e., task-adaptive training with pre-trained models and data augmentation. For the task-adaptive methods, Casanueva et al. [24] leveraged related conversational pre-trained models trained from a huge dialogue corpus to tackle few-shot intent detection. For the latter ones, Zhang et al. [33] proposed a data augmentation schema pretraining a model on an annotated pair from natural language inference (NLI) datasets and designed the nearest neighbor classification schema to adopt transfer learning and classify user intents.

However, the previous data augmentation-related methods, such as the one proposed by Liu et al. [34], are inefficient for training and hard to scale to tasks with lots of intents.

2.2. Text Data Augmentation for Few-Shot Learning

Text data augmentation can be simply regarded as the process of generating a large amount of data from few data. In some cases, the amount of given data of one category is so small that the model trained using them only grasps local features. The model treats these local features as global features for this category, resulting in a poor model. Therefore, we need to generate a series of different derived data based on these limited data, which can cover more features so that the distribution of the new dataset is closer to the real data distribution. Moreover, deep neural networks have numerous parameters, which require substantial training data to make them work normally. However, in reality, the available data are not enough to train a deep neural framework. Hence, data augmentation becomes a good option to address the data scarcity issue.

Currently, there are two kinds of mainstream approaches for data augmentation in the natural language processing (NLP) field: one is the raw text-oriented methods, and the other is the text representation-oriented methods. For instance, Wei and Zou [35] proposed a group of easy data augmentation (EDA) techniques consisting of synonym replacement, random insertion, random swap, and random deletion. However, such approaches may disrupt the syntactic structure and coherence of the original text. In addition, Yu et al. [36] generated new data by translating sentences into French and back into English, which has a high cost of implementation relative to performance gain. Furthermore, for text representation-oriented methods, Xie et al. [37] utilized data noising as smoothing to generate new representations in the embedding space. However, these models do not tackle well the following scenarios: In real scenarios, the few-shot intent recognition could be more challenging when there exist many fine-grained intents, especially semantically similar intents. Hence, for tackling the few-shot learning scenarios, Dopierre et al. [31] proposed an approach named PROTAUGMENT, which is a meta-learning algorithm leveraging a pretrained sequence-to-sequence model BART [38] for classification of short texts.

We argue that the above methods either destroy the syntactic structure or add meaningless noise to the sentence representations, which is not conducive to the task of intent recognition where intent categories are very similar. Therefore, inspired by Schick and Schütze [39], we propose to generate new data through a template without predicting the real word tokens to avoid introducing meaningless information.

3. Approach

We first describe the task formulation in Section 3.1. Then, we introduce the unsupervised cloze-style data augmentation strategy in Section 3.2. In Section 3.3, we discuss the metric-based contrastive learning. The overall framework of our proposal is shown in Figure 1.

3.1. Task Formulation

Given an m-word utterance $x = {x_{1}, x_{2}, \dots, x_{m}}$ , intent recognition is defined as a task of identifying the corresponding intent label y, i.e., $x \to y$ .

Since human annotation cannot cover the perpetual emergence of novel intent labels, the amount of well-labeled data cannot meet the requirements of traditional models in the training phase. As a result, the traditional models will be trapped in the low-resource dilemma, manifested in that satisfactory performance appears in the training phase but the much worse performance in the testing phase.

Few-shot intent recognition aims at identifying users’ potential intents based on the semantics of their utterances with only limited learnable samples. To address the low-resource dilemma, the few-shot learning paradigm organizes the training phase as a learning process for a great number of meta tasks (or “episodes” [7]). Specifically, a meta task $T$ usually consists of two parts: a support set $S$ and a query set $Q$ , i.e., $T = {S, Q}$ . For the support set $S$ , it strictly abides by the “N-way, K-shot” format, which can be formulated as:

(1) $S = \cup_{n = 1}^{N} {(x_{n, k}, y_{n})}_{k = 1}^{K} .$

and the query set

Q

is defined as the following format:

(2) $Q = {(x_{n, k}, y_{n})}_{k = 1}^{K_{Q}},$

where the label set in

Q

must be the same as in

S

, and the instances of the two do not overlap;

K_{Q}

is the number of queries. In this way, each meta task can be regarded as an imitation of a few-shot generalization from training to testing.

Few-shot intent recognition models are usually first trained on a series of meta tasks ${T_{t r a i n}^{(1)}, T_{t r a i n}^{(2)}, \dots}$ , then directly tested on another set of meta tasks only including unseen categories ${T_{t e s t}^{(1)}, T_{t e s t}^{(2)}, \dots}$ without fine-tuning. It should be pointed out that “unseen” means the non-interaction relations between the label sets in the training and testing stages. The performances of few-shot intent recognition models are measured by the accuracy of unseen meta tasks in the testing stages.

3.2. Unsupervised Cloze-Style Data Augmentation

We chose the widely used pre-trained masked language model BERT [5] as the feature extractor $F (\cdot)$ in our proposal. It is a model pre-trained on a large corpus that can be fine-tuned for downstream natural language processing tasks. One of the pre-training tasks is called masked language modeling. It simply masks some percentage of the input tokens with a “[MASK]” token, and then the model predicts these masked tokens according to the semantics of their context.

Following the format of the pre-training tasks, we designed an approach for data augmentation based on semantics. Specifically, inspired by Schick and Schütze [39], we introduce an auxiliary cloze-style template T to construct the pattern $Pat$ of data augmentation as follows:

(3) $\begin{matrix} T = The sentence : ‘___’ means [MASK] . \\ Pat (T, x) = The sentence : ‘ x ’ means [MASK] . \end{matrix}$

where

x

is the input sentence and [MASK] is the token that needs to be predicted. Since feature extractors such as BERT are already pre-trained on a large corpus, they can fill the mask according to the semantics of the sentence, i.e., the context of the [MASK] token. Thus, we transform the data augmentation task into a cloze task.

In detail, the feature extractor $F$ encodes the pattern $Pat (T, x)$ into hidden vector representations after adding two special tokens, “[CLS]” and “[SEP],” to represent the beginning and end of $Pat (T, x)$ , respectively. It can be formulated as follows:

(4) $F (Pat (T, x)) = [h_{Pat (T, x)}^{[CLS]}, h_{Pat (T, x)}^{1}, \dots, h_{Pat (T, x)}^{[MASK]}, h_{Pat (T, x)}^{[SEP]}],$

where the hidden vector

h_{Pat (T, x)}^{[MASK]}

is considered as the representation of the masked token [MASK]. Instead of predicting a real word based on the vector

h_{Pat (T, x)}^{[MASK]}

, as in Devlin et al. [5], we leverage it directly to represent the semantics of sentence x. As the pre-trained language model can “understand” the semantics of the pattern, the pattern can also guide the model to generate a proper vector that fits the context. In this way, we can regard

h_{Pat (T, x)}^{[MASK]}

as a representation of sentence x generated by the model based on the pattern

Pat (T, x)

, which is semantically similar to the input sentence x. Repeating Equation (4) over all input samples, we can get the corresponding data augmentation result.

However, pre-trained language models do not always generate vectors that fully match the semantics of the input sentence. Therefore, we need to devise a method to constrain the model to weaken this mismatch and finally obtain proper data augmentation results. Without introducing any external knowledge and labels, we designed an unsupervised learning method that utilizes the model’s own semantic understanding ability to force it to produce appropriate results as often as possible. We feed the sentence x into the model to obtain its low-dimensional vector representation, which can be formulated as:

(5) $F (x) = [h_{x}^{[CLS]}, h_{x}^{1}, \dots, h_{x}^{m}, h_{x}^{[SEP]}],$

where the hidden vector

h_{x}^{[CLS]}

is chosen as the representation of whole sentence x.

To constrain the model to produce the appropriate representation of [MASK], we designed a loss function to close the distance between $h_{Pat}^{[MASK]}$ and $h_{x}^{[CLS]}$ :

(6) $L_{u n s} (h_{Pat (T, x)}^{[MASK]}, h_{x}^{[CLS]}) = - \frac{h_{Pat (T, x)}^{[MASK]} \cdot h_{x}^{[CLS]}}{∥ h_{Pat (T, x)}^{[MASK]} ∥ \cdot ∥ h_{x}^{[CLS]} ∥} .$

3.3. Contrastive Learning at Different Granularities

3.3.1. Metric-Based Prototypical Classifier

After the unsupervised cloze-style data augmentation, following Snell et al. [7], we employ metric-based prototypical networks as classifiers to examine the effect of data augmentation. Prototypical networks first calculate the average representation of samples in the same category as their prototype:

(7) $c_{i} = \frac{1}{K_{i}} \sum_{k = 1}^{K_{i}} h_{x_{i, k}^{S}}^{[CLS]}, (i = 1, 2, \dots, N)$

where

c_{i}

denotes the prototype of category i,

K_{i}

denotes the number of samples in category i in support set

S

of the current meta task

T

and

h_{x_{i, k}^{S}}^{[CLS]}

denotes the representation of the k-th sentence in category i. By doing so, the samples in the same category can have the shortest mean distances to their center [7,40]. Similarly, we can obtain the augmented prototype

c_{i}^{^{'}}

with Equation (7) based on the corresponding

h_{Pat}^{[MASK]}

. Moreover, in order to make the final prototype more fully cover the common features of its category, we weight the prototype of the input samples and the prototype of results of data augmentation, which can be formulated as follows:

(8) $c_{i}^{*} = α \times c_{i} + (1 - α) \times c_{i}^{'}, α \in (0, 1)$

where

α

is a trade-off factor to control the corresponding contributions from the original input data and augmented data.

Given a score function $s (\cdot, \cdot)$ , prototypical networks predict a label for a query instance $x^{Q}$ by calculating the softmax distribution over similarities between the query embedding vector and the prototypes, as in Equation (9):

(9) $P (y = j | x^{Q}) = softmax (s (c_{j}^{*}, h_{x^{Q}}^{[CLS]})), (x^{Q} \in Q)$

where y is the predicted label;

x^{Q}

is the query instance in query set

Q

of the current meta task,

T

; j is the ground-truth label;

c_{j}^{*}

denotes the final prototypes based on initial and augmented data of category j; and we chose cosine similarity as

s (\cdot, \cdot)

. Furthermore, learning proceeds by minimizing the negative log-probability:

(10) $L_{c l s} = - log P (y = j | x^{Q}) .$

Since the prototypical networks predict the label by measuring the distances between query instances and prototypes, a proper distance distribution is critical to improving the intent recognition performance. Therefore, we propose the following two methods to obtain a distance distribution that is as satisfactorily as possible.

3.3.2. Prototype-Level Contrastive Learning

Considering that the prototype is computed with all samples of the corresponding category in the current meta task, the prototype can represent the common features of the samples in this category. Meanwhile, considering the metric-based prototypical networks, an intuitive way to improve the classification accuracy is to increase the distances between prototypes of different categories in embedding space.

Therefore, we introduce a contrastive-based loss for prototype-level learning in order to separate prototypes from different categories as much as possible. Specifically, our goal is to make the similarity of the prototype embeddings from different categories as small as possible, which can be formulated as follows:

(11) $L_{p r o} = - E [log \frac{exp (s (c_{i}, c_{i}))}{\sum_{i^{'} = 1}^{N} exp (s (c_{i}, c_{i}^{'}))}],$

where

s (\cdot, \cdot)

is the same similarity metric function as that in Equation (9). Therefore, the value of

s (c_{i}, c_{i})

is a constant, 1. Furthermore, Equation (11) can be simplified to the following form:

(12) $L_{p r o} = - E [log \frac{1}{e + \sum_{i^{'} = 1}^{N} exp (s (c_{i}, c_{i}^{'}))}], (i^{'} \neq i)$

where

e

is a constant. With the prototype-level contrastive loss

L_{p r o}

, we expect that the prototypes of different categories can be kept away from each other. However, conducting contrastive learning directly at the prototype level can only keep the mean representations of different classes away from each other. Such a method does not guarantee that samples of the same category are close, and the accuracy of intent recognition is not improved enough.

3.3.3. Instance-Level Contrastive Learning

To further improve intent recognition performance, we introduce instance-level contrastive learning. The strategy can not only make the instances in different categories far away from each other, but also make close the ones in the same category. This can be formulated as follows:

(13) $L_{i n s} = - E [log (\frac{\sum_{k} exp (s (h_{x_{i, k}^{S}}^{[CLS]}, h_{x_{+}^{S}})) + \sum_{k} exp (s (h_{x_{i, k}^{S}}^{[MASK]}, h_{x_{+}^{S}}))}{\sum_{i^{'} = 1}^{N} \sum_{k} exp (s (h_{x_{i, k}^{S}}^{[CLS]}, h_{x_{i^{'}, k}^{S}})) + \sum_{i^{'} = 1}^{N} \sum_{k} exp (s (h_{x_{i, k}^{S}}^{[MASK]}, h_{x_{i^{'}, k}^{S}}))})],$

where

h_{x_{+}^{S}}

denotes the positive instance for

h_{x_{i, k}^{S}}^{[CLS]}

and

h_{x_{i, k}^{S}}^{[MASK]}

, including embeddings of original utterances and augmented embeddings, which belongs to the same category as them. The similarity between the vector representations of samples in the same category can be increased, while the similarity of sample vectors in different categories can be reduced by minimizing the loss

L_{i n s}

We apply the results of data augmentation to instance-level contrastive learning as well. We expect that this strategy can not only help to obtain a more discriminative representation, but also optimize the effect of data augmentation from another perspective. Specifically, we constructed three types of positive pairs, one consisting only of sentence embeddings in the same class, one consisting of sentence embeddings and data augmentation vectors in the same class and one consisting only of data augmentation results of the same class.

In addition, the contrastive learning strategy itself is also a method of data augmentation. Even though it does not increase the absolute number of instances, it enables interactions between them. Therefore, it highlights the commonalities between samples of the same class and the distinguishing features between samples of different classes. Hence, we can obtain a better distance distribution in the embedding space, with which the performance of the metric-based classification method can be improved.

4. Experiments

4.1. Datasets and Metrics

We employed two public intent recognition datasets to evaluate the capabilities our proposal and baselines: CLINC-150 [23] and BANKING-77 [24]. CLINC-150 consists of 150 intent categories from 10 daily life domains, and 150 samples in each category. In addition, there are also some intent sentences labeled “out of scope” in the dataset, which are considered as noise with multiple unknown categories. In order to accurately examine the performances of the discussed models, we removed these samples labeled “out of scope” and only leveraged the well-labeled samples for training and testing. BANKING-77 is a single-domain dataset for intent recognition containing $13,083$ samples in 77 categories in the banking domain. The statistics of CLINC-150 and BANKING-77 are provided in Table 1.

Moreover, we employed about 2/3 categories in each dataset for training to learn the general knowledge, and the remaining 1/3 were equally divided into the validation set and test set. Following previous work [23,24], we chose recognition accuracy as the evaluation metric in this study.

4.2. Model Summary

We validated the effectiveness of our proposed model by comparing it with the following competitive baselines:

Prototypical Networks [7]: A metric-based model for few-shot classification which employs the distances between samples in embedding space to measure their similarity. It considers the label of the prototype closest to the query sample as the prediction of its class.
GCN [41,42]: A graph convolutional networks based approach for few-shot classification, which regards few-shot learning as a supervised message passing task that can be trained in an end-to-end way.
Matching Networks [8]: A framework for few-shot classification which trains a network mapping a small labeled support set and unlabeled instances to their labels and avoids depending on fine-tuning to adapt to new categories.

4.3. Research Questions

To validate the effectiveness of our methods, we address the following research questions:

RQ1 Can our proposal beat the competitive few-shot learning baselines for the intent recognition task?
RQ2 Which module in CDA contributes the most to the recognition accuracy?
RQ3 What are the impacts of different templates on model performance?

4.4. Model Configuration

Following the common practice for few-shot learning experiments [7], we discuss two types of meta tasks with different numbers of shots, including “5-way 1-shot” and “5-way 5-shot.” For all discussed models, we applied the same feature extractor (i.e., bert-base-uncased) to encode the input sentences.

In the training phase, stochastic gradient decent (SGD) with a step-decay learning rate was employed to optimize our model. Moreover, we applied the early stopping in training when no loss decay was returned. Therefore, each model was trained with 100 iterations, each of which contained 20 episodes. After each training iteration, we sampled 100 meta tasks from the test set to evaluate the model’s performance. Furthermore, the hyperparameter $α$ was searched in 0.5, 0.6, 0.7 and 0.8, and was finally fixed to 0.7.

5. Results and Discussion

5.1. Overall Evaluation

To answer RQ1, we examine the intent recognition models’ capabilities for two types of meta tasks on CLINC-150 and BANKING-77. The overall intent recognition performances of all discussed models are shown in Table 2.

Obviously, we can see that all the models performed better for the meta task with a larger shot number regardless of the dataset. This is because as the shot number increases, the number of samples available to the models increases, and the common features obtained from the samples are more similar to the real common features.

Now, we concentrate on the performances of baselines. We can see that MatchNet obtained the highest accuracy among the three discussed baselines on “5-way 1-shot” meta tasks on both datasets, and ProtoNet achieved the best performance on “5-way 5-shot” meta tasks on both datasets. The advantages of MatchNet on the “1-shot” meta task can be explained by the fact that the ad hoc similarity matching computation can well boost the model’s performance. For ProtoNet, the reason why it has the advantage in the “5-shot” meta task is that it can fuse the features of instances in the same category to obtain their commonality.

Next, we focus on the performance of our proposals. Comparing the baseline models and the CDA model, we can see that on the dataset CLINC-150, both CDA-PC and CDA-IC outperformed all the discussed baselines. However, we can see that on BANKING-77, the performances of CDA-PC in “5-way 1-shot” and “5-way 5-shot” were weaker than those of MatchNet and ProtoNet, respectively. This is because the samples in the same class in CLINC-150 dataset are short sentences and more similar to each other than those in BANKING-77. Thus, the results of data augmentation are similar to the initial input sentences, which helps the model obtain their common features. However, since the BANKING-77 dataset is more professional than CLINC-150, the pre-trained language model has less knowledge related to it than that of CLINC-150. Therefore, if the augmented samples are directly used to calculate the category prototypes, it is equivalent to introducing noise, which will weaken the characteristics of the category itself, thereby reducing the recognition performance.

Aiming at the problems in the application of CDA-PC, CDA-IC leverages an instance-level contrastive learning strategy to improve the performance of few-shot intent recognition. The advantages of CDA-IC can be explained by the fact that the instance-level contrastive learning strategy places both the initial data and the corresponding augmented data in the same category as positive examples. Such a method enables each sample to interact with more data, which can not only shorten the distance between the initial input data of the same category in space but also make the augmented data semantically similar to the original data of the same category.

The improvements of CDA-IC over the best baseline model in terms of accuracy were $4.36 %$ in the “5-way 1-shot” meta task and $4.91 %$ in the “5-way 5-shot” meta task on the CLINC-150 dataset. Accuracy improvements were $1.69 %$ in the “5-way 1-shot” meta task and $1.86 %$ in the “5-way 5-shot” meta task on the BANKING-77 dataset.

5.2. Ablation Study

To answer RQ2, we analyzed the importance of different modules in our CDA-IC model by removing two fundamental components of CDA-IC separately, i.e., the instance-level contrastive learning module and the unsupervised learning module, respectively.

The ablation results are shown in Table 3.

Clearly, removing any component of CDA-IC leads the performance degeneration, which demonstrates that the unsupervised learning module and the instance-level contrastive learning module are both important for improving the few-shot intent recognition ability. In particular, removing the instance-level contrastive learning module caused the worst performance drop in both types of meta task, regardless of dataset. For instance, the CDA-IC model without the instance-level contrastive learning module showed performance degradations of $3.63 %$ and $3.82 %$ in the “5-way 1-shot” meta task and “5-way 5-shot” meta task on CLINC-150. For the BANKING-77 dataset, the CDA-IC model without the instance-level contrastive learning module saw $4.16 %$ and $4.44 %$ performance drops in the “5-way 1-shot” meta task and the “5-way 5-shot” meta task, respectively.

In addition, it is worth noting that each module has its unique contribution. In detail, the performance degradation caused by removing the unsupervised learning module in the “5-way 1-shot” meta task was more than that in the “5-way 5-shot” meta task, which illustrates that it plays a more obvious role in the case of insufficient features and is more conducive to improving the performance of few-shot intent recognition. Furthermore, the instance-level contrastive learning module plays a more important role in the “5-way 5-shot” meta task than in the “5-way 1-shot” meta task. This phenomenon can be explained by the fact that the bottleneck limiting performance in this position is no longer the lack of features, but the mining of commonality of the same category and uniqueness of different categories. The instance-level contrastive learning module can not only shorten the distance between the samples of the same category in the embedding space, but also increase the distance between embeddings of different categories, i.e., mining the commonality of the same category and the uniqueness of different categories.

5.3. Impacts of Different Templates

To answer RQ3, we designed three different templates and applied them to the pattern for data augmentation. All types of discussed templates are shown in Table 4.

Since our proposal is based on a pre-trained language model, it needs to utilize templates for the generation of semantically similar data. As different templates use different words and punctuation, i.e., tokens, the semantic vectors obtained from the pre-trained language model are also different. The details are illustrated in Figure 2.

Clearly, different templates do cause obvious changes in model performance. Specifically, in the “5-way 1-shot” meta task performed on the CLINC-150 dataset, the performance difference between different templates was nearly $1 %$ . In addition, as shown in Figure 2b, in the “5-way 5-shot” meta task performed on the BANKING-77 dataset, the performance difference between different templates reached as high as $1.3 %$ .

From the overall trend, the length of the template is not directly related to the effect of data augmentation. In detail, although Template 2 is the shortest one, it does not perform the worst on the CLINC-150 dataset. Its performance on the “5-way 1-shot” meta task is very similar to the performance of Template 3 and better than that of Template 1. It is worth noting that compared to the other two templates, Template 3 performs the best on all tasks on both CLINC-150 and BANKING-77 datasets. This phenomenon can be explained by the fact that Template 3 is the most explicit about the semantic guidance of the [MASK] token. When the original input sentence is placed into the template, Template 3 clearly states that [MASK] represents the intent of the input sentence. Therefore, the generated semantic embedding vector is more directional.

In summary, the design of the templates has an obvious impact on the performance of data augmentation. A good template can provide appropriate semantic guidance to effectively improve the performance of data augmentation.

6. Conclusions

We proposed a cloze-style data augmentation (CDA) model for few-shot intent recognition. Inspired by the pre-training task of language models, we designed an unsupervised template-based strategy for data augmentation, hoping to generate meaningful data without breaking syntactic structure and adding noise. Furthermore, to make full use of the limited data and obtain separable embeddings, we perform contrastive learning between the original data and the augmented data. Thus, each sample can interact with samples of all remaining categories, thereby distinguishing the embeddings of different categories in the embedding space. Experiment results obtained on the CLINC-150 and BANKING-77 datasets illustrate its effectiveness compared to all discussed baselines. In addition, an extensive ablation study showed that the contrastive module is the most important component in the whole model.

In future work, we will study how to design a good template, which helps to reduce the impact on data augmentation. The templates showed in this paper can be called “hard templates”, which are composed of real tokens. Such “hard templates” conform to human language habits but are not necessarily easy to understand for pretrained language models. We envision that if the templates are initialized as sets of trainable vectors rather than real words, they could gradually adapt to the usage scenario during training, which could then be called “soft templates”.

Author Contributions

Funding acquisition, C.C.; project administration, H.C.; validation, M.J.; writing—original draft, X.Z.; writing—review and editing, J.Z. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

View Image - Figure 1. The cloze-style data augmentation model. In contrastive learning, the dashed lines mean moving points away from each other, and solid lines mean moving points closer to each other.

Figure 1. The cloze-style data augmentation model. In contrastive learning, the dashed lines mean moving points away from each other, and solid lines mean moving points closer to each other.

Figure 2. Performances of different templates in “5-way 1-shot” and “5-way 5-shot” meta tasks on CLINC-150 and BANKING-77 datasets.

Table 1

Statistics of CLINC-150 and BANKING-77.

Dataset	# Categories	# Samples	# Domains
CLINC-150	150	22,500	10
BANKING-77	77	13,083	1

Table 2

Overall performance in terms of accuracy (%) and $95 %$ confidence interval on the test set for two types of meta tasks. The results produced by the best performer in each column are boldfaced. The results produced by the best baseline are underlined.

Model	CLINC-150		BANKING-77
Model	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Shot
ProtoNet	85.96 ± 0.18	90.46 ± 0.13	68.14 ± 1.22	79.48 ± 0.92
MatchNet	86.63 ± 0.25	89.73 ± 0.60	68.88 ± 1.31	77.60 ± 1.02
GCN	84.96 ± 0.17	90.03 ± 0.64	68.39 ± 1.21	77.66 ± 0.98
CDA-PC (Ours)	88.73 ± 0.51	94.59 ± 0.27	68.50 ± 1.20	77.84 ± 1.07
CDA-IC (Ours)	90.99 ± 0.52	95.37 ± 0.27	70.57 ± 0.76	81.34 ± 0.92

Table 3

Ablation studies of CDA-IC for the 5-way 1-shot and 5-way 5-shot meta tasks on CLINC-150 and BANKING-77. The biggest drop of an independent module in each column is appended ▾.

Model	CLINC-150		BANKING-77
Model	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Shot
CDA-IC	88.46 ± 0.47	94.13 ± 0.51	68.81 ± 1.11	79.91 ± 0.95
w/o unsupervised learning	$(- 2.53 %)$	$(- 1.24 %)$	$(- 1.76 %)$	$(- 1.40 %)$
CDA-IC	87.36 ± 0.17 ▾	91.55 ± 0.35 ▾	66.41 ± 1.21 ▾	76.90 ± 0.92 ▾
w/o contrastive learning	$(- 3.63 %)$	$(- 3.82 %)$	$(- 4.16 %)$	$(- 4.44 %)$
CDA-IC	83.00 ± 0.57	88.47 ± 0.75	66.14 ± 1.32	76.41 ± 1.00
w/o both modules	$(- 7.99 %)$	$(- 6.90 %)$	$(- 4.43 %)$	$(- 4.93 %)$
CDA-IC	90.99 ± 0.52	95.37 ± 0.27	70.57 ± 0.76	81.34 ± 0.92

Table 4

Three types of template for cloze-style data augmentation.

No.	Template
1	The sentence: ‘___’ means [MASK].
2	‘___’ means [MASK].
3	The intent in ‘___’ means [MASK].

References

1. Jolly, S.; Falke, T.; Tirkaz, C.; Sorokin, D. Data-Efficient Paraphrase Generation to Bootstrap Intent Classification and Slot Labeling for New Features in Task-Oriented Dialog Systems. Proceedings of the 28th International Conference on Computational Linguistics: Industry Track; Online, 12 December 2020; pp. 10-20.

2. Zhou, S.; Jia, J.; Wu, Z. Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 2–9 February 2021; pp. 6039-6047.

3. Vargas, S.; Castells, P.; Vallet, D. Intent-oriented diversity in recommender systems. Proceedings of the SIGIR’11: 34th International ACM SIGIR conference on research and development in Information Retrieval; Beijing, China, 24–28 July 2011; pp. 1211-1212.

4. Wang, X.; Huang, T.; Wang, D.; Yuan, Y.; Liu, Z.; He, X.; Chua, T. Learning Intents behind Interactions with Knowledge Graph for Recommendation. Proceedings of the WWW ’21: The Web Conference 2021; Ljubljana, Slovenia, 19–23 April 2021; pp. 878-887.

5. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, MN, USA, 2–7 June 2019; pp. 4171-4186.

6. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv; 2019; arXiv: 1907.11692

7. Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, CA, USA, 4–9 December 2017; pp. 4077-4087.

8. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. Proceedings of the 30th International Conference on Neural Information Processing Systems; Barcelona, Spain, 5–10 December 2016; pp. 3630-3638.

9. Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv.; 2020; 53, pp. 63:1-63:34. [DOI: https://dx.doi.org/10.1145/3386252]

10. Zheng, J.; Cai, F.; Chen, H.; de Rijke, M. Pre-train, Interact, Fine-tune: A novel interaction representation for text classification. Inf. Process. Manag.; 2020; 57, 102215. [DOI: https://dx.doi.org/10.1016/j.ipm.2020.102215]

11. Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA, 28–30 January 2019; pp. 6407-6414.

12. Bansal, T.; Jha, R.; Munkhdalai, T.; McCallum, A. Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020; Online, 16–20 November 2020; pp. 522-534.

13. Zheng, J.; Cai, F.; Chen, W.; Lei, W.; Chen, H. Taxonomy-aware Learning for Few-Shot Event Detection. Proceedings of the 30th Web Conference; Virtual Event/Ljubljana, Slovenia, 19–23 April 2021; pp. 3546-3557.

14. Lai, V.D.; Nguyen, M.V.; Nguyen, T.H.; Dernoncourt, F. Graph Learning Regularization and Transfer Learning for Few-Shot Event Detection. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual, 11–15 July 2021; pp. 2172-2176.

15. Zheng, J.; Cai, F.; Chen, H. Incorporating Scenario Knowledge into a Unified Fine-tuning Architecture for Event Representation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, China, 25–30 July 2020; pp. 249-258.

16. Zheng, J.; Cai, F.; Ling, Y.; Chen, H. Heterogeneous Graph Neural Networks to Predict What Happen Next. Proceedings of the 28th International Conference on Computational Linguistics; Barcelona, Spain (Online), 8–13 December 2020; pp. 328-338.

17. Hou, Y.; Lai, Y.; Wu, Y.; Che, W.; Liu, T. Few-shot Learning for Multi-label Intent Detection. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 22 February–1 March 2022; AAAI Press: Palo Alto, CA, USA, 2021; pp. 13036-13044.

18. Dopierre, T.; Gravier, C.; Subercaze, J.; Logerais, W. Few-shot Pseudo-Labeling for Intent Detection. Proceedings of the 28th International Conference on Computational Linguistics; Barcelona, Spain, 8–13 December 2020; pp. 4993-5003.

19. Yang, S.; Liu, L.; Xu, M. Free Lunch for Few-shot Learning: Distribution Calibration. Proceedings of the International Conference on Learning Representations; Vienna, Austria, 3–7 May 2021.

20. Abdulmumin, I.; Galadanci, B.S.; Isa, A. Enhanced Back-Translation for Low Resource Neural Machine Translation Using Self-training. Proceedings of the Information and Communication Technology and Applications; Minna, Nigeria, 24–27 November 2020; Volume 1350, pp. 355-371.

21. Goyal, T.; Durrett, G. Neural Syntactic Preordering for Controlled Paraphrase Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020; Online, 5–10 July 2020; pp. 238-252.

22. Dopierre, T.; Gravier, C.; Logerais, W. ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); Virtual, 1–6 August 2021; pp. 2454-2466.

23. Cavalin, P.R.; Ribeiro, V.H.A.; Appel, A.P.; Pinhanez, C.S. Improving Out-of-Scope Detection in Intent Classification by Using Embeddings of the Word Graph Space of the Classes. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Online, 16–20 November 2020; pp. 3952-3961.

24. Casanueva, I.; Temčinas, T.; Gerz, D.; Henderson, M.; Vulić, I. Efficient Intent Detection with Dual Sentence Encoders. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI; Online, 9 July 2020; pp. 38-45. [DOI: https://dx.doi.org/10.18653/v1/2020.nlp4convai-1.5]

25. Abro, W.A.; Qi, G.; Ali, Z.; Feng, Y.; Aamir, M. Multi-turn intent determination and slot filling with neural networks and regular expressions. Knowl. Based Syst.; 2020; 208, 106428. [DOI: https://dx.doi.org/10.1016/j.knosys.2020.106428]

26. Weld, H.; Huang, X.; Long, S.; Poon, J.; Han, S.C. A survey of joint intent detection and slot-filling models in natural language understanding. arXiv; 2021; arXiv: 2101.08091[DOI: https://dx.doi.org/10.1145/3547138]

27. Sarikaya, R.; Hinton, G.E.; Ramabhadran, B. Deep belief nets for natural language call-routing. Proceedings of the ICASSP; Prague, Czech Republic, 22–27 May 2011; pp. 5680-5683.

28. Liu, B.; Lane, I.R. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. Proceedings of the INTERSPEECH; San Francisco, CA, USA, 8–12 September 2016; pp. 685-689.

29. Chen, Q.; Zhuo, Z.; Wang, W. BERT for Joint Intent Classification and Slot Filling. arXiv; 2019; arXiv: 1902.10909

30. Liu, H.; Zhang, X.; Fan, L.; Fu, X.; Li, Q.; Wu, X.; Lam, A.Y.S. Reconstructing Capsule Networks for Zero-shot Intent Classification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing; Hong Kong, China, 3–7 November 2019; pp. 4798-4808.

31. Dopierre, T.; Gravier, C.; Logerais, W. PROTAUGMENT: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Online, 1–6 August 2021.

32. Zhang, X.; Cai, F.; Hu, X.; Zheng, J.; Chen, H. A Contrastive learning-based Task Adaptation model for few-shot intent recognition. Inf. Process. Manag.; 2022; 59, 102863. [DOI: https://dx.doi.org/10.1016/j.ipm.2021.102863]

33. Zhang, J.; Hashimoto, K.; Liu, W.; Wu, C.; Wan, Y.; Yu, P.S.; Socher, R.; Xiong, C. Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020; Online, 16–20 November 2020; pp. 5064-5082.

34. Liu, Z.; Fan, Z.; Wang, Y.; Yu, P.S. Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer. Proceedings of the SIGIR: 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual, 11–15 July 2021; pp. 1608-1612.

35. Wei, J.W.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the EMNLP/IJCNLP (1): Association for Computational Linguistics; Hong Kong, China, 3–7 November 2019; pp. 6381-6387.

36. Yu, A.W.; Dohan, D.; Luong, M.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. Proceedings of the ICLR (Poster); Vancouver, BC, Canada, 30 April–3 May 2018.

37. Xie, Z.; Wang, S.I.; Li, J.; Lévy, D.; Nie, A.; Jurafsky, D.; Ng, A.Y. Data Noising as Smoothing in Neural Network Language Models. Proceedings of the ICLR (Poster); Toulon, France, 24–26 April 2017.

38. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the ACL: Association for Computational Linguistics; Online, 5–10 July 2020; pp. 7871-7880.

39. Schick, T.; Schütze, H. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. Proceedings of the EACL: Association for Computational Linguistics; Online, 19–23 April 2021; pp. 255-269.

40. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.; 2005; 6, pp. 1705-1749.

41. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. Proceedings of the ICLR (Poster); Toulon, France, 24–26 April 2017.

42. Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. Proceedings of the ICLR (Poster); Vancouver, BC, Canada, 30 April–3 May 2018.

Word count: 6993

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Intent recognition aims to identify users’ potential intents from their utterances, which is a key component in task-oriented dialog systems. A real challenge, however, is that the number of intent categories has grown faster than human-annotated data, resulting in only a small amount of data being available for many new intent categories. This lack of data leads to the overfitting of traditional deep neural networks on a small amount of training data, which seriously affects practical applications. Hence, some researchers have proposed few-shot learning should address the data-scarcity issue. One of the efficient methods is text augmentation, which always generates noisy or meaningless data. To address these issues, we propose leveraging the knowledge in pre-trained language models and constructed the cloze-style data augmentation (CDA) model. We employ unsupervised learning to force the augmented data to be semantically similar to the initial input sentences and contrastive learning to enhance the uniqueness of each category. Experimental results on CLINC-150 and BANKING-77 datasets show the effectiveness of our proposal by its beating of the competitive baselines. In addition, we conducted an ablation study to verify the function of each module in our models, and the results illustrate that the contrastive learning module plays the most important role in improving the recognition accuracy.

Details

Title

Cloze-Style Data Augmentation for Few-Shot Intent Recognition

Author

Zhang, Xin

; Jiang, Miao

; Chen, Honghui; Chen, Chonghao

; Zheng, Jianming

First page

3358

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math10183358

ProQuest document ID

2716571456

Cloze-Style Data Augmentation for Few-Shot Intent Recognition

Jump to:

Full text

Abstract

Details

Suggested sources