Unsupervised domain adaptation for lip reading

Abstract

We present an unsupervised domain adaptation (UDA) method for a lip-reading model that is an image-based speech recognition model. Most of conventional UDA methods cannot be applied when the adaptation data consists of an unknown class, such as out-of-vocabulary words. In this paper, we propose a cross-modal knowledge distillation (KD)-based domain adaptation method, where we use the intermediate layer output in the audio-based speech recognition model as a teacher for the unlabeled adaptation data. Because the audio signal contains more information for recognizing speech than lip images, the knowledge of the audio-based model can be used as a powerful teacher in cases where the unlabeled adaptation data consists of audio-visual parallel data. In addition, because the proposed intermediate-layer-based KD can express the teacher as the sub-class (sub-word)-level representation, this method allows us to use the data of unknown classes for the adaptation. Through experiments on an image-based word recognition task, we demonstrate that the proposed approach can not only improve the UDA performance but can also use the unknown-class adaptation data.

Details

Title

Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation

Author

Takashima Yuki¹; Takashima Ryoichi¹

; Tsunoda Ryota¹; Aihara Ryo²; Takiguchi Tetsuya¹; Ariki Yasuo¹; Motoyama Nobuaki²

¹ Kobe University, Graduate School of System Informatics, Kobe, Japan (GRID:grid.31432.37) (ISNI:0000 0001 1092 3077)
² Mitsubishi Electric Corporation, Information Technology R&D Center, Ofuna, Japan (GRID:grid.462605.3) (ISNI:0000 0001 0662 3151)

Publication year

2021

Publication date

Dec 2021

Publisher

Springer Nature B.V.

ISSN

16874714

e-ISSN

16874722

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1186/s13636-021-00232-5

ProQuest document ID

2608953515

© The Author(s) 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation

Jump to:

Abstract

Details

Suggested sources