Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

We present a novel encoder-only Transformer model for symbolic music harmony generation, based on a fixed time-grid representation of melody and harmony. Inspired by denoising diffusion processes, our model progressively unmasks harmony tokens over a sequence of discrete stages, learning to reconstruct the full harmonic structure from partial context. Unlike autoregressive models, this formulation enables flexible, non-sequential generation and supports explicit control over harmony placement. The model is stage-aware, receiving timestep embeddings analogous to diffusion timesteps, and is conditioned on both a binary piano roll and a pitch class roll to capture melodic context. We explore two unmasking schedules—random token revealing and midpoint doubling—both requiring a fixed and significantly reduced number of model calls at inference time. While our approach achieves competitive performance with strong autoregressive baselines (GPT-2 and BART) across several harmonic metrics, its key advantages lie in controllability, structured decoding with fixed inference steps, and alignment with musical structure. Ablation studies further highlight the role of stage awareness and pitch class conditioning. Our results position this method as a viable and interpretable alternative for symbolic harmony generation and a foundation for future work on structured, controllable musical modeling.

Details

Title
Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid
Author
Kaliakatsos-Papakostas Maximos 1 ; Makris Dimos 2   VIAFID ORCID Logo  ; Soiledis Konstantinos 2   VIAFID ORCID Logo  ; Konstantinos-Theodoros, Tsamis 2 ; Katsouros Vassilis 3 ; Cambouropoulos Emilios 4 

 Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Institute of Language and Speech Processing, Athena RC, 15125 Marousi, Greece, Archimedes, Athena RC, 15125 Marousi, Greece 
 Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Archimedes, Athena RC, 15125 Marousi, Greece 
 Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Institute of Language and Speech Processing, Athena RC, 15125 Marousi, Greece 
 School of Music Studies, Aristotle University of Thessaloniki, 57001 Thessaloniki, Greece; [email protected] 
First page
9513
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3249675488
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.