Full text

Turn on search term navigation

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

Details

Title
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
Author
Ahmad, Rehan 1   VIAFID ORCID Logo  ; Zubair, Syed 2   VIAFID ORCID Logo  ; Alquhayz, Hani 3   VIAFID ORCID Logo  ; Ditta, Allah 4 

 Department of Electrical Engineering, International Islamic University, Islamabad 44000, Pakistan 
 Analytics Camp, Islamabad 44000, Pakistan; [email protected] 
 Department of Computer Science and Information, College of Science in Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia; [email protected] 
 Division of Science & Technology, University of Education, Township, Lahore 54770, Pakistan; [email protected] 
First page
5163
Publication year
2019
Publication date
2019
Publisher
MDPI AG
e-ISSN
14248220
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2535541353
Copyright
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.