A large TV dataset for speech and music activity

Document 1 of 1

More like this

Full Text
Scholarly Journal

Try and log in through your library or institution to see if they have access.

Abstract

Automatic speech and music activity detection (SMAD) is an enabling task that can help segment, index, and pre-process audio content in radio broadcast and TV programs. However, due to copyright concerns and the cost of manual annotation, the limited availability of diverse and sizeable datasets hinders the progress of state-of-the-art (SOTA) data-driven approaches. We address this challenge by presenting a large-scale dataset containing Mel spectrogram, VGGish, and MFCCs features extracted from around 1600 h of professionally produced audio tracks and their corresponding noisy labels indicating the approximate location of speech and music segments. The labels are several sources such as subtitles and cuesheet. A test set curated by human annotators is also included as a subset for evaluation. To validate the generalizability of the proposed dataset, we conduct several experiments comparing various model architectures and their variants under different conditions. The results suggest that our proposed dataset is able to serve as a reliable training resource and leads to SOTA performances on various public datasets. To the best of our knowledge, this dataset is the first large-scale, open-sourced dataset that contains features extracted from professionally produced audio tracks and their corresponding frame-level speech and music annotations.

Details

Title

A large TV dataset for speech and music activity detection

Author

Hung, Yun-Ning¹

; Wu, Chih-Wei²; Orife, Iroro²; Hipple, Aaron²; Wolcott, William²; Lerch, Alexander¹

¹ Georgia Institute of Technology, Music Informatics Group, Atlanta, USA (GRID:grid.213917.f) (ISNI:0000 0001 2097 4943)
² Netflix, Inc., Los Gatos, USA (GRID:grid.497062.e) (ISNI:0000 0004 6358 9334)

Publication year

2022

Publication date

Dec 2022

Publisher

Springer Nature B.V.

ISSN

16874714

e-ISSN

16874722

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1186/s13636-022-00253-8

ProQuest document ID

2709397595

© The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

A large TV dataset for speech and music activity detection

Jump to:

Abstract

Details

Full text options

Suggested sources