Abstract

The field of music information retrieval has seen a proliferation in the use of deep learning techniques, with an apparent consensus in the use of convolutional neural networks and spectrograms as input. This is surprising as the data contained in spectrograms breaks the underlying assumptions imposed by the convolution operation. Literature shows researchers have tried to get around this problem by changing the input data or applying recurrence, but have met limited success. Attention mechanisms have been shown to outperform recurrence when tracking long-range dependencies. Despite these occurring naturally in music, researchers have been slow to adopt attention for music tasks.

The work focuses on the novel application of attention-augmented convolutional neural networks to musical instrument recognition tasks. A single-source, single-label setting is first investigated to evaluate the impact of the attention augmentation in a clean setting. The created network is tasked with identifying one of nineteen possible orchestral musical instruments from an audio file. The proposed architecture augments the final convolution modules from a baseline convolutional template with attention mechanisms. The introduction of attention enhances the network's ability to extract the model of sound generation from spectrograms. The ratio of convolution to attention is manipulated, and two different spectrogram types, CQT and STFT, are used to assess the efficacy of applying attention to this task. It was found that the networks containing a 25\% attention augmentation outperformed their non-augmented and further augmented counterparts.

Predominant musical instrument identification is investigated next, which involves identifying an arbitrary number of instruments deemed predominant in a multi-source, multi-label setting. The impact of the attention augmentation is assessed across various networks and input representations that build off the first phase to enhance representational power while reducing dimensionality. Specifically, networks are trained and tested across five different attention amounts with center loss, with and without two auxiliary classifications, and using four different input spectrogram representations. In total, 60 different networks are created, trained, and tested for comparison. Once again, networks augmented with 25\% attention are demonstrated to outperform their convolutional-only and further augmented counterparts, achieving micro and macro F1 scores of 0.6743 and 0.5514 respectively.

Details

Title
Paying Attention to Music: Applying Attention Mechanisms to Automatic Music Transcription
Author
Wise, Andrew J.
Publication year
2024
Publisher
ProQuest Dissertations & Theses
ISBN
9798383179406
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3073193482
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.