1. Introduction
Intent recognition is a branch of text classification in natural language understanding (NLU), focusing on identifying users’ potential purposes from their utterances. For instance, it could recognize the intention “bill balance” from the utterance “What is my bill for water and electricity?”. Intent recognition plays an important role in various downstream tasks, such as dialog systems [1,2] and recommender systems [3,4].
However, in real-world applications, new intent categories emerge rapidly and only have limited well-labeled data, making it difficult to directly apply them to optimize existing deep neural networks. The networks always include a pretrained language model as the backbone to encode the text data into continuous low-dimensional vectors, such as BERT [5] and RoBERTa [6]. Such models always have complex architectures with many layers, and therefore have a considerable number of parameters. If a small amount of training data is directly leveraged for updating the parameters of the deep neural network model based on the traditional training paradigm, the model will only be able to capture the local features, leading to a lack of generalization ability and the overfitting problem—i.e., the good performance on the training set and poor performance on the test set. To handle such a problem, a few-shot learning (FSL) strategy was proposed by Snell et al. [7], Vinyals et al. [8], Wang et al. [9] to assist the model to obtain generalizability with only limited data. The researchers mentioned above regard the few-shot intent recognition as a meta-learning problem. It simulates few-shot scenarios through a series of small meta tasks. This method is widely employed in the field of few-shot text classification tasks [10], such as relation classification [11,12], event detection [13,14,15,16] and intent detection [17,18].
A cornerstone challenge is that meta learning-based few-shot learning methods can still easily be trapped into the dilemma of overfitting on the biased distribution which is caused by the limited training samples [19]. Some researchers have attempted to prevent the overfitting problem via data augmentation methods. One of the key ideas is back-translation [20]: translating other language representations of the input text into the initial language. Another common approach is to leverage an external knowledge base to obtain expressions that are semantically similar to the original sentence [21,22]. Specifically, Dopierre et al. [22] introduced several knowledge bases to generate diverse paraphrased sentences of the original inputs rather than reordering the tokens. However, though back-translation can generate diverse expressions for the same semantics, it has poor performance in short texts. The generated expressions are often similar or even identical to the original input sentences. Concerning the paraphrase generation method, we argue that it is not suitable for text augmentation in all domains, since it is not always possible to find a corresponding external knowledge base.
To address the aforementioned issues, we propose an approach for data augmentation which is suitable for short texts, without any knowledge-based participation. In this work, we consider the pre-trained language model itself as a knowledge-base, since it has been trained on a large text corpus and thus can perform some simple tasks. Taking the BERT-derived model as an example, one of the pre-training tasks it performs on the corpus is implementing the masked language model (MLM), i.e., predicting the token of the given sentence at the “
To verify the effectiveness of our proposal, we conducted comprehensive experiments on the public datasets CLINC-150 [23] and BANKING-77 [24] with two types of meta tasks. The experiment results demonstrate that our proposal achieves obvious improvements over competitive baselines. In addition, the ablation study further verified the effectiveness of both unsupervised and contrastive learning strategies in our proposal.
In summary, the main contributions in this work can be briefly listed as follows:
We only employ the knowledge of the pre-trained language model itself to augment the data, avoiding the dependence on external knowledge.
We use an unsupervised learning strategy to leverage original input samples for meaningful data augmentation.
We apply contrastive learning at different granularity to make full use of the limited amount of available instances in a meta task.
2. Related Work
2.1. Few-Shot Intent Recognition
Intent recognition is an important task in the field of natural language understanding (NLU) [25] and is well studied in multiple applications, such as conversational agents [24], in the task-oriented dialogue systems [26]. Various neural architectures have been applied to the intent recognition task. For instance, Sarikaya et al. [27] proposed to leverage recurrent neural networks (RNN) to deal with the task. In addition, Liu and Lane [28] utilized an attention-based method, and Chen et al. [29] attempted to settle the problem with transformer models. As a pre-task of slot filling, intent recognition can help the dialogue systems to clarify a user’s intention in each turn and then respond by marking the word containing the intention in the utterance [30].
Since the task-oriented dialogue systems are domain-specific, intent recognition is a challenging task because of the well-labeled data scarcity and the number of categories it involves [18,31]. Although the traditional methods for intent recognition can obtain satisfactory performance in conventional application scenarios, the process of labeling data can no longer reach the speed at which new intent categories emerge, which leads to a data-starved situation for traditional methods [32]. Therefore, there is only a small amount of well-labeled data in a considerable number of categories, which is called a low-resource dilemma. When facing the low-resource dilemma, traditional methods are easily trapped in the overfitting morass, manifesting as high accuracy in the training phase but much worse performance in the testing phase. Several existing few-shot learning approaches try to address this problem, mainly from two aspects, i.e., task-adaptive training with pre-trained models and data augmentation. For the task-adaptive methods, Casanueva et al. [24] leveraged related conversational pre-trained models trained from a huge dialogue corpus to tackle few-shot intent detection. For the latter ones, Zhang et al. [33] proposed a data augmentation schema pretraining a model on an annotated pair from natural language inference (NLI) datasets and designed the nearest neighbor classification schema to adopt transfer learning and classify user intents.
However, the previous data augmentation-related methods, such as the one proposed by Liu et al. [34], are inefficient for training and hard to scale to tasks with lots of intents.
2.2. Text Data Augmentation for Few-Shot Learning
Text data augmentation can be simply regarded as the process of generating a large amount of data from few data. In some cases, the amount of given data of one category is so small that the model trained using them only grasps local features. The model treats these local features as global features for this category, resulting in a poor model. Therefore, we need to generate a series of different derived data based on these limited data, which can cover more features so that the distribution of the new dataset is closer to the real data distribution. Moreover, deep neural networks have numerous parameters, which require substantial training data to make them work normally. However, in reality, the available data are not enough to train a deep neural framework. Hence, data augmentation becomes a good option to address the data scarcity issue.
Currently, there are two kinds of mainstream approaches for data augmentation in the natural language processing (NLP) field: one is the raw text-oriented methods, and the other is the text representation-oriented methods. For instance, Wei and Zou [35] proposed a group of easy data augmentation (EDA) techniques consisting of synonym replacement, random insertion, random swap, and random deletion. However, such approaches may disrupt the syntactic structure and coherence of the original text. In addition, Yu et al. [36] generated new data by translating sentences into French and back into English, which has a high cost of implementation relative to performance gain. Furthermore, for text representation-oriented methods, Xie et al. [37] utilized data noising as smoothing to generate new representations in the embedding space. However, these models do not tackle well the following scenarios: In real scenarios, the few-shot intent recognition could be more challenging when there exist many fine-grained intents, especially semantically similar intents. Hence, for tackling the few-shot learning scenarios, Dopierre et al. [31] proposed an approach named PROTAUGMENT, which is a meta-learning algorithm leveraging a pretrained sequence-to-sequence model BART [38] for classification of short texts.
We argue that the above methods either destroy the syntactic structure or add meaningless noise to the sentence representations, which is not conducive to the task of intent recognition where intent categories are very similar. Therefore, inspired by Schick and Schütze [39], we propose to generate new data through a template without predicting the real word tokens to avoid introducing meaningless information.
3. Approach
We first describe the task formulation in Section 3.1. Then, we introduce the unsupervised cloze-style data augmentation strategy in Section 3.2. In Section 3.3, we discuss the metric-based contrastive learning. The overall framework of our proposal is shown in Figure 1.
3.1. Task Formulation
Given an m-word utterance , intent recognition is defined as a task of identifying the corresponding intent label y, i.e., .
Since human annotation cannot cover the perpetual emergence of novel intent labels, the amount of well-labeled data cannot meet the requirements of traditional models in the training phase. As a result, the traditional models will be trapped in the low-resource dilemma, manifested in that satisfactory performance appears in the training phase but the much worse performance in the testing phase.
Few-shot intent recognition aims at identifying users’ potential intents based on the semantics of their utterances with only limited learnable samples. To address the low-resource dilemma, the few-shot learning paradigm organizes the training phase as a learning process for a great number of meta tasks (or “episodes” [7]). Specifically, a meta task usually consists of two parts: a support set and a query set , i.e., . For the support set , it strictly abides by the “N-way, K-shot” format, which can be formulated as:
(1)
and the query set is defined as the following format:(2)
where the label set in must be the same as in , and the instances of the two do not overlap; is the number of queries. In this way, each meta task can be regarded as an imitation of a few-shot generalization from training to testing.Few-shot intent recognition models are usually first trained on a series of meta tasks , then directly tested on another set of meta tasks only including unseen categories without fine-tuning. It should be pointed out that “unseen” means the non-interaction relations between the label sets in the training and testing stages. The performances of few-shot intent recognition models are measured by the accuracy of unseen meta tasks in the testing stages.
3.2. Unsupervised Cloze-Style Data Augmentation
We chose the widely used pre-trained masked language model BERT [5] as the feature extractor in our proposal. It is a model pre-trained on a large corpus that can be fine-tuned for downstream natural language processing tasks. One of the pre-training tasks is called masked language modeling. It simply masks some percentage of the input tokens with a “
Following the format of the pre-training tasks, we designed an approach for data augmentation based on semantics. Specifically, inspired by Schick and Schütze [39], we introduce an auxiliary cloze-style template T to construct the pattern of data augmentation as follows:
(3)
where is the input sentence andIn detail, the feature extractor encodes the pattern into hidden vector representations after adding two special tokens, “
(4)
where the hidden vector is considered as the representation of the masked tokenHowever, pre-trained language models do not always generate vectors that fully match the semantics of the input sentence. Therefore, we need to devise a method to constrain the model to weaken this mismatch and finally obtain proper data augmentation results. Without introducing any external knowledge and labels, we designed an unsupervised learning method that utilizes the model’s own semantic understanding ability to force it to produce appropriate results as often as possible. We feed the sentence x into the model to obtain its low-dimensional vector representation, which can be formulated as:
(5)
where the hidden vector is chosen as the representation of whole sentence x.To constrain the model to produce the appropriate representation of
(6)
3.3. Contrastive Learning at Different Granularities
3.3.1. Metric-Based Prototypical Classifier
After the unsupervised cloze-style data augmentation, following Snell et al. [7], we employ metric-based prototypical networks as classifiers to examine the effect of data augmentation. Prototypical networks first calculate the average representation of samples in the same category as their prototype:
(7)
where denotes the prototype of category i, denotes the number of samples in category i in support set of the current meta task and denotes the representation of the k-th sentence in category i. By doing so, the samples in the same category can have the shortest mean distances to their center [7,40]. Similarly, we can obtain the augmented prototype with Equation (7) based on the corresponding . Moreover, in order to make the final prototype more fully cover the common features of its category, we weight the prototype of the input samples and the prototype of results of data augmentation, which can be formulated as follows:(8)
where is a trade-off factor to control the corresponding contributions from the original input data and augmented data.Given a score function , prototypical networks predict a label for a query instance by calculating the softmax distribution over similarities between the query embedding vector and the prototypes, as in Equation (9):
(9)
where y is the predicted label; is the query instance in query set of the current meta task, ; j is the ground-truth label; denotes the final prototypes based on initial and augmented data of category j; and we chose cosine similarity as . Furthermore, learning proceeds by minimizing the negative log-probability:(10)
Since the prototypical networks predict the label by measuring the distances between query instances and prototypes, a proper distance distribution is critical to improving the intent recognition performance. Therefore, we propose the following two methods to obtain a distance distribution that is as satisfactorily as possible.3.3.2. Prototype-Level Contrastive Learning
Considering that the prototype is computed with all samples of the corresponding category in the current meta task, the prototype can represent the common features of the samples in this category. Meanwhile, considering the metric-based prototypical networks, an intuitive way to improve the classification accuracy is to increase the distances between prototypes of different categories in embedding space.
Therefore, we introduce a contrastive-based loss for prototype-level learning in order to separate prototypes from different categories as much as possible. Specifically, our goal is to make the similarity of the prototype embeddings from different categories as small as possible, which can be formulated as follows:
(11)
where is the same similarity metric function as that in Equation (9). Therefore, the value of is a constant, 1. Furthermore, Equation (11) can be simplified to the following form:(12)
where is a constant. With the prototype-level contrastive loss , we expect that the prototypes of different categories can be kept away from each other. However, conducting contrastive learning directly at the prototype level can only keep the mean representations of different classes away from each other. Such a method does not guarantee that samples of the same category are close, and the accuracy of intent recognition is not improved enough.3.3.3. Instance-Level Contrastive Learning
To further improve intent recognition performance, we introduce instance-level contrastive learning. The strategy can not only make the instances in different categories far away from each other, but also make close the ones in the same category. This can be formulated as follows:
(13)
where denotes the positive instance for and , including embeddings of original utterances and augmented embeddings, which belongs to the same category as them. The similarity between the vector representations of samples in the same category can be increased, while the similarity of sample vectors in different categories can be reduced by minimizing the loss .We apply the results of data augmentation to instance-level contrastive learning as well. We expect that this strategy can not only help to obtain a more discriminative representation, but also optimize the effect of data augmentation from another perspective. Specifically, we constructed three types of positive pairs, one consisting only of sentence embeddings in the same class, one consisting of sentence embeddings and data augmentation vectors in the same class and one consisting only of data augmentation results of the same class.
In addition, the contrastive learning strategy itself is also a method of data augmentation. Even though it does not increase the absolute number of instances, it enables interactions between them. Therefore, it highlights the commonalities between samples of the same class and the distinguishing features between samples of different classes. Hence, we can obtain a better distance distribution in the embedding space, with which the performance of the metric-based classification method can be improved.
4. Experiments
4.1. Datasets and Metrics
We employed two public intent recognition datasets to evaluate the capabilities our proposal and baselines: CLINC-150 [23] and BANKING-77 [24]. CLINC-150 consists of 150 intent categories from 10 daily life domains, and 150 samples in each category. In addition, there are also some intent sentences labeled “out of scope” in the dataset, which are considered as noise with multiple unknown categories. In order to accurately examine the performances of the discussed models, we removed these samples labeled “out of scope” and only leveraged the well-labeled samples for training and testing. BANKING-77 is a single-domain dataset for intent recognition containing samples in 77 categories in the banking domain. The statistics of CLINC-150 and BANKING-77 are provided in Table 1.
Moreover, we employed about 2/3 categories in each dataset for training to learn the general knowledge, and the remaining 1/3 were equally divided into the validation set and test set. Following previous work [23,24], we chose recognition accuracy as the evaluation metric in this study.
4.2. Model Summary
We validated the effectiveness of our proposed model by comparing it with the following competitive baselines:
Prototypical Networks [7]: A metric-based model for few-shot classification which employs the distances between samples in embedding space to measure their similarity. It considers the label of the prototype closest to the query sample as the prediction of its class.
GCN [41,42]: A graph convolutional networks based approach for few-shot classification, which regards few-shot learning as a supervised message passing task that can be trained in an end-to-end way.
Matching Networks [8]: A framework for few-shot classification which trains a network mapping a small labeled support set and unlabeled instances to their labels and avoids depending on fine-tuning to adapt to new categories.
4.3. Research Questions
To validate the effectiveness of our methods, we address the following research questions:
RQ1 Can our proposal beat the competitive few-shot learning baselines for the intent recognition task?
RQ2 Which module in CDA contributes the most to the recognition accuracy?
RQ3 What are the impacts of different templates on model performance?
4.4. Model Configuration
Following the common practice for few-shot learning experiments [7], we discuss two types of meta tasks with different numbers of shots, including “5-way 1-shot” and “5-way 5-shot.” For all discussed models, we applied the same feature extractor (i.e., bert-base-uncased) to encode the input sentences.
In the training phase, stochastic gradient decent (SGD) with a step-decay learning rate was employed to optimize our model. Moreover, we applied the early stopping in training when no loss decay was returned. Therefore, each model was trained with 100 iterations, each of which contained 20 episodes. After each training iteration, we sampled 100 meta tasks from the test set to evaluate the model’s performance. Furthermore, the hyperparameter was searched in 0.5, 0.6, 0.7 and 0.8, and was finally fixed to 0.7.
5. Results and Discussion
5.1. Overall Evaluation
To answer RQ1, we examine the intent recognition models’ capabilities for two types of meta tasks on CLINC-150 and BANKING-77. The overall intent recognition performances of all discussed models are shown in Table 2.
Obviously, we can see that all the models performed better for the meta task with a larger shot number regardless of the dataset. This is because as the shot number increases, the number of samples available to the models increases, and the common features obtained from the samples are more similar to the real common features.
Now, we concentrate on the performances of baselines. We can see that MatchNet obtained the highest accuracy among the three discussed baselines on “5-way 1-shot” meta tasks on both datasets, and ProtoNet achieved the best performance on “5-way 5-shot” meta tasks on both datasets. The advantages of MatchNet on the “1-shot” meta task can be explained by the fact that the ad hoc similarity matching computation can well boost the model’s performance. For ProtoNet, the reason why it has the advantage in the “5-shot” meta task is that it can fuse the features of instances in the same category to obtain their commonality.
Next, we focus on the performance of our proposals. Comparing the baseline models and the CDA model, we can see that on the dataset CLINC-150, both CDA-PC and CDA-IC outperformed all the discussed baselines. However, we can see that on BANKING-77, the performances of CDA-PC in “5-way 1-shot” and “5-way 5-shot” were weaker than those of MatchNet and ProtoNet, respectively. This is because the samples in the same class in CLINC-150 dataset are short sentences and more similar to each other than those in BANKING-77. Thus, the results of data augmentation are similar to the initial input sentences, which helps the model obtain their common features. However, since the BANKING-77 dataset is more professional than CLINC-150, the pre-trained language model has less knowledge related to it than that of CLINC-150. Therefore, if the augmented samples are directly used to calculate the category prototypes, it is equivalent to introducing noise, which will weaken the characteristics of the category itself, thereby reducing the recognition performance.
Aiming at the problems in the application of CDA-PC, CDA-IC leverages an instance-level contrastive learning strategy to improve the performance of few-shot intent recognition. The advantages of CDA-IC can be explained by the fact that the instance-level contrastive learning strategy places both the initial data and the corresponding augmented data in the same category as positive examples. Such a method enables each sample to interact with more data, which can not only shorten the distance between the initial input data of the same category in space but also make the augmented data semantically similar to the original data of the same category.
The improvements of CDA-IC over the best baseline model in terms of accuracy were in the “5-way 1-shot” meta task and in the “5-way 5-shot” meta task on the CLINC-150 dataset. Accuracy improvements were in the “5-way 1-shot” meta task and in the “5-way 5-shot” meta task on the BANKING-77 dataset.
5.2. Ablation Study
To answer RQ2, we analyzed the importance of different modules in our CDA-IC model by removing two fundamental components of CDA-IC separately, i.e., the instance-level contrastive learning module and the unsupervised learning module, respectively.
The ablation results are shown in Table 3.
Clearly, removing any component of CDA-IC leads the performance degeneration, which demonstrates that the unsupervised learning module and the instance-level contrastive learning module are both important for improving the few-shot intent recognition ability. In particular, removing the instance-level contrastive learning module caused the worst performance drop in both types of meta task, regardless of dataset. For instance, the CDA-IC model without the instance-level contrastive learning module showed performance degradations of and in the “5-way 1-shot” meta task and “5-way 5-shot” meta task on CLINC-150. For the BANKING-77 dataset, the CDA-IC model without the instance-level contrastive learning module saw and performance drops in the “5-way 1-shot” meta task and the “5-way 5-shot” meta task, respectively.
In addition, it is worth noting that each module has its unique contribution. In detail, the performance degradation caused by removing the unsupervised learning module in the “5-way 1-shot” meta task was more than that in the “5-way 5-shot” meta task, which illustrates that it plays a more obvious role in the case of insufficient features and is more conducive to improving the performance of few-shot intent recognition. Furthermore, the instance-level contrastive learning module plays a more important role in the “5-way 5-shot” meta task than in the “5-way 1-shot” meta task. This phenomenon can be explained by the fact that the bottleneck limiting performance in this position is no longer the lack of features, but the mining of commonality of the same category and uniqueness of different categories. The instance-level contrastive learning module can not only shorten the distance between the samples of the same category in the embedding space, but also increase the distance between embeddings of different categories, i.e., mining the commonality of the same category and the uniqueness of different categories.
5.3. Impacts of Different Templates
To answer RQ3, we designed three different templates and applied them to the pattern for data augmentation. All types of discussed templates are shown in Table 4.
Since our proposal is based on a pre-trained language model, it needs to utilize templates for the generation of semantically similar data. As different templates use different words and punctuation, i.e., tokens, the semantic vectors obtained from the pre-trained language model are also different. The details are illustrated in Figure 2.
Clearly, different templates do cause obvious changes in model performance. Specifically, in the “5-way 1-shot” meta task performed on the CLINC-150 dataset, the performance difference between different templates was nearly . In addition, as shown in Figure 2b, in the “5-way 5-shot” meta task performed on the BANKING-77 dataset, the performance difference between different templates reached as high as .
From the overall trend, the length of the template is not directly related to the effect of data augmentation. In detail, although Template 2 is the shortest one, it does not perform the worst on the CLINC-150 dataset. Its performance on the “5-way 1-shot” meta task is very similar to the performance of Template 3 and better than that of Template 1. It is worth noting that compared to the other two templates, Template 3 performs the best on all tasks on both CLINC-150 and BANKING-77 datasets. This phenomenon can be explained by the fact that Template 3 is the most explicit about the semantic guidance of the
In summary, the design of the templates has an obvious impact on the performance of data augmentation. A good template can provide appropriate semantic guidance to effectively improve the performance of data augmentation.
6. Conclusions
We proposed a cloze-style data augmentation (CDA) model for few-shot intent recognition. Inspired by the pre-training task of language models, we designed an unsupervised template-based strategy for data augmentation, hoping to generate meaningful data without breaking syntactic structure and adding noise. Furthermore, to make full use of the limited data and obtain separable embeddings, we perform contrastive learning between the original data and the augmented data. Thus, each sample can interact with samples of all remaining categories, thereby distinguishing the embeddings of different categories in the embedding space. Experiment results obtained on the CLINC-150 and BANKING-77 datasets illustrate its effectiveness compared to all discussed baselines. In addition, an extensive ablation study showed that the contrastive module is the most important component in the whole model.
In future work, we will study how to design a good template, which helps to reduce the impact on data augmentation. The templates showed in this paper can be called “hard templates”, which are composed of real tokens. Such “hard templates” conform to human language habits but are not necessarily easy to understand for pretrained language models. We envision that if the templates are initialized as sets of trainable vectors rather than real words, they could gradually adapt to the usage scenario during training, which could then be called “soft templates”.
Funding acquisition, C.C.; project administration, H.C.; validation, M.J.; writing—original draft, X.Z.; writing—review and editing, J.Z. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The data presented in this study are available on request from the corresponding author.
The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figure 1. The cloze-style data augmentation model. In contrastive learning, the dashed lines mean moving points away from each other, and solid lines mean moving points closer to each other.
Figure 2. Performances of different templates in “5-way 1-shot” and “5-way 5-shot” meta tasks on CLINC-150 and BANKING-77 datasets.
Statistics of CLINC-150 and BANKING-77.
| Dataset | # Categories | # Samples | # Domains |
|---|---|---|---|
| CLINC-150 | 150 | 22,500 | 10 |
| BANKING-77 | 77 | 13,083 | 1 |
Overall performance in terms of accuracy (%) and
| Model | CLINC-150 | BANKING-77 | ||
|---|---|---|---|---|
| 5-Way 1-Shot | 5-Way 5-Shot | 5-Way 1-Shot | 5-Way 5-Shot | |
| ProtoNet | 85.96 ± 0.18 | 90.46 ± 0.13 | 68.14 ± 1.22 | 79.48 ± 0.92 |
| MatchNet | 86.63 ± 0.25 | 89.73 ± 0.60 | 68.88 ± 1.31 | 77.60 ± 1.02 |
| GCN | 84.96 ± 0.17 | 90.03 ± 0.64 | 68.39 ± 1.21 | 77.66 ± 0.98 |
| CDA-PC (Ours) | 88.73 ± 0.51 | 94.59 ± 0.27 | 68.50 ± 1.20 | 77.84 ± 1.07 |
| CDA-IC (Ours) | 90.99 ± 0.52 | 95.37 ± 0.27 | 70.57 ± 0.76 | 81.34 ± 0.92 |
Ablation studies of CDA-IC for the 5-way 1-shot and 5-way 5-shot meta tasks on CLINC-150 and BANKING-77. The biggest drop of an independent module in each column is appended ▾.
| Model | CLINC-150 | BANKING-77 | ||
|---|---|---|---|---|
| 5-Way 1-Shot | 5-Way 5-Shot | 5-Way 1-Shot | 5-Way 5-Shot | |
| CDA-IC | 88.46 ± 0.47 | 94.13 ± 0.51 | 68.81 ± 1.11 | 79.91 ± 0.95 |
| w/o unsupervised learning |
|
|
|
|
| CDA-IC | 87.36 ± 0.17 ▾ | 91.55 ± 0.35 ▾ | 66.41 ± 1.21 ▾ | 76.90 ± 0.92 ▾ |
| w/o contrastive learning |
|
|
|
|
| CDA-IC | 83.00 ± 0.57 | 88.47 ± 0.75 | 66.14 ± 1.32 | 76.41 ± 1.00 |
| w/o both modules |
|
|
|
|
| CDA-IC | 90.99 ± 0.52 | 95.37 ± 0.27 | 70.57 ± 0.76 | 81.34 ± 0.92 |
Three types of template for cloze-style data augmentation.
| No. | Template |
|---|---|
| 1 | The sentence: ‘___’ means |
| 2 | ‘___’ means |
| 3 | The intent in ‘___’ means |
References
1. Jolly, S.; Falke, T.; Tirkaz, C.; Sorokin, D. Data-Efficient Paraphrase Generation to Bootstrap Intent Classification and Slot Labeling for New Features in Task-Oriented Dialog Systems. Proceedings of the 28th International Conference on Computational Linguistics: Industry Track; Online, 12 December 2020; pp. 10-20.
2. Zhou, S.; Jia, J.; Wu, Z. Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 2–9 February 2021; pp. 6039-6047.
3. Vargas, S.; Castells, P.; Vallet, D. Intent-oriented diversity in recommender systems. Proceedings of the SIGIR’11: 34th International ACM SIGIR conference on research and development in Information Retrieval; Beijing, China, 24–28 July 2011; pp. 1211-1212.
4. Wang, X.; Huang, T.; Wang, D.; Yuan, Y.; Liu, Z.; He, X.; Chua, T. Learning Intents behind Interactions with Knowledge Graph for Recommendation. Proceedings of the WWW ’21: The Web Conference 2021; Ljubljana, Slovenia, 19–23 April 2021; pp. 878-887.
5. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, MN, USA, 2–7 June 2019; pp. 4171-4186.
6. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv; 2019; arXiv: 1907.11692
7. Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, CA, USA, 4–9 December 2017; pp. 4077-4087.
8. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. Proceedings of the 30th International Conference on Neural Information Processing Systems; Barcelona, Spain, 5–10 December 2016; pp. 3630-3638.
9. Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv.; 2020; 53, pp. 63:1-63:34. [DOI: https://dx.doi.org/10.1145/3386252]
10. Zheng, J.; Cai, F.; Chen, H.; de Rijke, M. Pre-train, Interact, Fine-tune: A novel interaction representation for text classification. Inf. Process. Manag.; 2020; 57, 102215. [DOI: https://dx.doi.org/10.1016/j.ipm.2020.102215]
11. Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification. Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA, 28–30 January 2019; pp. 6407-6414.
12. Bansal, T.; Jha, R.; Munkhdalai, T.; McCallum, A. Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020; Online, 16–20 November 2020; pp. 522-534.
13. Zheng, J.; Cai, F.; Chen, W.; Lei, W.; Chen, H. Taxonomy-aware Learning for Few-Shot Event Detection. Proceedings of the 30th Web Conference; Virtual Event/Ljubljana, Slovenia, 19–23 April 2021; pp. 3546-3557.
14. Lai, V.D.; Nguyen, M.V.; Nguyen, T.H.; Dernoncourt, F. Graph Learning Regularization and Transfer Learning for Few-Shot Event Detection. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual, 11–15 July 2021; pp. 2172-2176.
15. Zheng, J.; Cai, F.; Chen, H. Incorporating Scenario Knowledge into a Unified Fine-tuning Architecture for Event Representation. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual Event, China, 25–30 July 2020; pp. 249-258.
16. Zheng, J.; Cai, F.; Ling, Y.; Chen, H. Heterogeneous Graph Neural Networks to Predict What Happen Next. Proceedings of the 28th International Conference on Computational Linguistics; Barcelona, Spain (Online), 8–13 December 2020; pp. 328-338.
17. Hou, Y.; Lai, Y.; Wu, Y.; Che, W.; Liu, T. Few-shot Learning for Multi-label Intent Detection. Proceedings of the AAAI Conference on Artificial Intelligence; Virtual, 22 February–1 March 2022; AAAI Press: Palo Alto, CA, USA, 2021; pp. 13036-13044.
18. Dopierre, T.; Gravier, C.; Subercaze, J.; Logerais, W. Few-shot Pseudo-Labeling for Intent Detection. Proceedings of the 28th International Conference on Computational Linguistics; Barcelona, Spain, 8–13 December 2020; pp. 4993-5003.
19. Yang, S.; Liu, L.; Xu, M. Free Lunch for Few-shot Learning: Distribution Calibration. Proceedings of the International Conference on Learning Representations; Vienna, Austria, 3–7 May 2021.
20. Abdulmumin, I.; Galadanci, B.S.; Isa, A. Enhanced Back-Translation for Low Resource Neural Machine Translation Using Self-training. Proceedings of the Information and Communication Technology and Applications; Minna, Nigeria, 24–27 November 2020; Volume 1350, pp. 355-371.
21. Goyal, T.; Durrett, G. Neural Syntactic Preordering for Controlled Paraphrase Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020; Online, 5–10 July 2020; pp. 238-252.
22. Dopierre, T.; Gravier, C.; Logerais, W. ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); Virtual, 1–6 August 2021; pp. 2454-2466.
23. Cavalin, P.R.; Ribeiro, V.H.A.; Appel, A.P.; Pinhanez, C.S. Improving Out-of-Scope Detection in Intent Classification by Using Embeddings of the Word Graph Space of the Classes. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Online, 16–20 November 2020; pp. 3952-3961.
24. Casanueva, I.; Temčinas, T.; Gerz, D.; Henderson, M.; Vulić, I. Efficient Intent Detection with Dual Sentence Encoders. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI; Online, 9 July 2020; pp. 38-45. [DOI: https://dx.doi.org/10.18653/v1/2020.nlp4convai-1.5]
25. Abro, W.A.; Qi, G.; Ali, Z.; Feng, Y.; Aamir, M. Multi-turn intent determination and slot filling with neural networks and regular expressions. Knowl. Based Syst.; 2020; 208, 106428. [DOI: https://dx.doi.org/10.1016/j.knosys.2020.106428]
26. Weld, H.; Huang, X.; Long, S.; Poon, J.; Han, S.C. A survey of joint intent detection and slot-filling models in natural language understanding. arXiv; 2021; arXiv: 2101.08091[DOI: https://dx.doi.org/10.1145/3547138]
27. Sarikaya, R.; Hinton, G.E.; Ramabhadran, B. Deep belief nets for natural language call-routing. Proceedings of the ICASSP; Prague, Czech Republic, 22–27 May 2011; pp. 5680-5683.
28. Liu, B.; Lane, I.R. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. Proceedings of the INTERSPEECH; San Francisco, CA, USA, 8–12 September 2016; pp. 685-689.
29. Chen, Q.; Zhuo, Z.; Wang, W. BERT for Joint Intent Classification and Slot Filling. arXiv; 2019; arXiv: 1902.10909
30. Liu, H.; Zhang, X.; Fan, L.; Fu, X.; Li, Q.; Wu, X.; Lam, A.Y.S. Reconstructing Capsule Networks for Zero-shot Intent Classification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing; Hong Kong, China, 3–7 November 2019; pp. 4798-4808.
31. Dopierre, T.; Gravier, C.; Logerais, W. PROTAUGMENT: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Online, 1–6 August 2021.
32. Zhang, X.; Cai, F.; Hu, X.; Zheng, J.; Chen, H. A Contrastive learning-based Task Adaptation model for few-shot intent recognition. Inf. Process. Manag.; 2022; 59, 102863. [DOI: https://dx.doi.org/10.1016/j.ipm.2021.102863]
33. Zhang, J.; Hashimoto, K.; Liu, W.; Wu, C.; Wan, Y.; Yu, P.S.; Socher, R.; Xiong, C. Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020; Online, 16–20 November 2020; pp. 5064-5082.
34. Liu, Z.; Fan, Z.; Wang, Y.; Yu, P.S. Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer. Proceedings of the SIGIR: 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Virtual, 11–15 July 2021; pp. 1608-1612.
35. Wei, J.W.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the EMNLP/IJCNLP (1): Association for Computational Linguistics; Hong Kong, China, 3–7 November 2019; pp. 6381-6387.
36. Yu, A.W.; Dohan, D.; Luong, M.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. Proceedings of the ICLR (Poster); Vancouver, BC, Canada, 30 April–3 May 2018.
37. Xie, Z.; Wang, S.I.; Li, J.; Lévy, D.; Nie, A.; Jurafsky, D.; Ng, A.Y. Data Noising as Smoothing in Neural Network Language Models. Proceedings of the ICLR (Poster); Toulon, France, 24–26 April 2017.
38. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the ACL: Association for Computational Linguistics; Online, 5–10 July 2020; pp. 7871-7880.
39. Schick, T.; Schütze, H. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. Proceedings of the EACL: Association for Computational Linguistics; Online, 19–23 April 2021; pp. 255-269.
40. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.; 2005; 6, pp. 1705-1749.
41. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. Proceedings of the ICLR (Poster); Toulon, France, 24–26 April 2017.
42. Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. Proceedings of the ICLR (Poster); Vancouver, BC, Canada, 30 April–3 May 2018.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Intent recognition aims to identify users’ potential intents from their utterances, which is a key component in task-oriented dialog systems. A real challenge, however, is that the number of intent categories has grown faster than human-annotated data, resulting in only a small amount of data being available for many new intent categories. This lack of data leads to the overfitting of traditional deep neural networks on a small amount of training data, which seriously affects practical applications. Hence, some researchers have proposed few-shot learning should address the data-scarcity issue. One of the efficient methods is text augmentation, which always generates noisy or meaningless data. To address these issues, we propose leveraging the knowledge in pre-trained language models and constructed the cloze-style data augmentation (CDA) model. We employ unsupervised learning to force the augmented data to be semantically similar to the initial input sentences and contrastive learning to enhance the uniqueness of each category. Experimental results on CLINC-150 and BANKING-77 datasets show the effectiveness of our proposal by its beating of the competitive baselines. In addition, we conducted an ablation study to verify the function of each module in our models, and the results illustrate that the contrastive learning module plays the most important role in improving the recognition accuracy.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





