Abstract
With the prevalence of pre-trained language models (PLMs) and the pre-training–fine-tuning paradigm, it has been continuously shown that larger models tend to yield better performance. However, as PLMs scale up, fine-tuning and storing all the parameters is prohibitively costly and eventually becomes practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, which optimizes a small portion of the model parameters while keeping the rest fixed, drastically cutting down computation and storage costs. In general, it demonstrates that large-scale models could be effectively stimulated by the optimization of a few parameters. Despite the various designs, here we discuss and analyse the approaches under a more consistent and accessible term ‘delta-tuning’, where ‘delta’ a mathematical notation often used to denote changes, is borrowed to refer to the portion of parameters that are ‘changed’ during training. We formally describe the problem and propose a unified categorization criterion for existing delta-tuning methods to explore their correlations and differences. We also discuss the theoretical principles underlying the effectiveness of delta-tuning and interpret them from the perspectives of optimization and optimal control. Furthermore, we provide a holistic empirical study on over 100 natural language processing tasks and investigate various aspects of delta-tuning. With comprehensive study and analysis, our research demonstrates the theoretical and practical properties of delta-tuning in the adaptation of PLMs.
Training a deep neural network can be costly but training time is reduced when a pre-trained network can be adapted to different use cases. Ideally, only a small number of parameters needs to be changed in this process of fine-tuning, which can then be more easily distributed. In this Analysis, different methods of fine-tuning with only a small number of parameters are compared on a large set of natural language processing tasks.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
; Qin, Yujia 1 ; Yang, Guang 2 ; Wei, Fuchao 2 ; Yang, Zonghan 2 ; Su, Yusheng 1 ; Hu, Shengding 1 ; Chen, Yulin 3 ; Chan, Chi-Min 2 ; Chen, Weize 1 ; Yi, Jing 1 ; Zhao, Weilin 1 ; Wang, Xiaozhi 2 ; Liu, Zhiyuan 1
; Zheng, Hai-Tao 3
; Chen, Jianfei 2 ; Liu, Yang 2 ; Tang, Jie 1 ; Li, Juanzi 2 ; Sun, Maosong 1
1 Tsinghua University, Department of Computer Science and Technology, Beijing, China (GRID:grid.12527.33) (ISNI:0000 0001 0662 3178); Beijing Academy of Artificial Intelligence, Beijing, China (GRID:grid.511045.4)
2 Tsinghua University, Department of Computer Science and Technology, Beijing, China (GRID:grid.12527.33) (ISNI:0000 0001 0662 3178)
3 Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China (GRID:grid.12527.33) (ISNI:0000 0001 0662 3178)




