It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
The field of music information retrieval has seen a proliferation in the use of deep learning techniques, with an apparent consensus in the use of convolutional neural networks and spectrograms as input. This is surprising as the data contained in spectrograms breaks the underlying assumptions imposed by the convolution operation. Literature shows researchers have tried to get around this problem by changing the input data or applying recurrence, but have met limited success. Attention mechanisms have been shown to outperform recurrence when tracking long-range dependencies. Despite these occurring naturally in music, researchers have been slow to adopt attention for music tasks.
The work focuses on the novel application of attention-augmented convolutional neural networks to musical instrument recognition tasks. A single-source, single-label setting is first investigated to evaluate the impact of the attention augmentation in a clean setting. The created network is tasked with identifying one of nineteen possible orchestral musical instruments from an audio file. The proposed architecture augments the final convolution modules from a baseline convolutional template with attention mechanisms. The introduction of attention enhances the network's ability to extract the model of sound generation from spectrograms. The ratio of convolution to attention is manipulated, and two different spectrogram types, CQT and STFT, are used to assess the efficacy of applying attention to this task. It was found that the networks containing a 25\% attention augmentation outperformed their non-augmented and further augmented counterparts.
Predominant musical instrument identification is investigated next, which involves identifying an arbitrary number of instruments deemed predominant in a multi-source, multi-label setting. The impact of the attention augmentation is assessed across various networks and input representations that build off the first phase to enhance representational power while reducing dimensionality. Specifically, networks are trained and tested across five different attention amounts with center loss, with and without two auxiliary classifications, and using four different input spectrogram representations. In total, 60 different networks are created, trained, and tested for comparison. Once again, networks augmented with 25\% attention are demonstrated to outperform their convolutional-only and further augmented counterparts, achieving micro and macro F1 scores of 0.6743 and 0.5514 respectively.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer