Diffusion-Inspired Masked Language Modeling for

Full text

Turn on search term navigation

1. Introduction

Transformer architectures are increasingly being studied for a wide range of sequence generation tasks. Melodic harmonization in symbolic music is a particularly interesting case. Given a melodic sequence (a series of notes), the goal is to generate a harmonic sequence (a series of chords) that is coherent and musically compatible with the melody. This task requires that the generated chords align not only with the local melodic context but also maintain harmonic coherence across the entire sequence. In this way, the decision for each chord must integrate information from both the melody and harmony, spanning local transitions (e.g., chord to chord) and global structures (e.g., recurring patterns). These properties make melodic harmonization a rich domain for exploring advanced sequence modeling methods.

Melodic harmonization has been approached with a variety of neural sequence models, including bidirectional LSTMs [1,2,3,4] and Transformer-based architectures [5,6,7,8]. In these approaches, harmonization is often framed as a translation or summarization task, where the melody is “translated” into a compatible harmonic sequence that can also serve as a reduction or harmonic abstraction or summarization of the original melody. Notably, all existing melodic harmonization methods, to our knowledge, rely on autoregressive decoding, which generates chords sequentially one at a time. Additionally, such methods consider chord rhythm patterns that include chord repetitions as part of the melodic harmonization process. This differs from a stricter definition of melodic harmonization as assigning harmony or chords to melody segments and disregarding chord repetitions.

Symbolic music diffusion has evolved in parallel to developments in image-based diffusion models, although they have not yet been applied to melodic harmonization. Some approaches perform diffusion in a continuous approximation of the discrete token space [9,10], while others apply diffusion in the latent space of a VAE [11]. For discrete representations, recent work treats the symbolic music surface as piano roll images and applies diffusion in the image domain using U-Net architectures [12,13,14,15,16,17].

Recent advances in discrete diffusion models for language have further demonstrated the potential of gradually denoising or unmasking sequences in non-autoregressive settings. Approaches like MaskGIT [18] and D3PMs [19] apply diffusion-style generation over discrete tokens rather than continuous vectors. These methods leverage iterative refinement, enabling flexible conditioning and often faster generation than autoregressive models. While diffusion was initially developed for continuous data in vision [20], its adaptation to discrete spaces has opened promising directions for text, code, and symbolic domains. Inspired by this trend, our work brings such iterative, flexible generation to the domain of melodic harmonization, which has remained underexplored in the context of diffusion.

Closer to our proposed approach are methods that combine transformer-based models with ideas from diffusion. These include models that apply diffusion to transformer logits [21] or to discrete latent codes learned through VQ-VAE [22]. Particularly relevant is the token-based masking and unmasking strategy of discrete diffusion probabilistic models (D3PMs) proposed in [23], where generation is performed by gradually unmasking tokens using a transformer encoder. This class of models allows flexible conditioning, enabling parts of the musical sequence to be fixed while the rest is generated. Such flexibility is difficult to achieve in autoregressive models, where left-to-right generation limits global control without backtracking.

This paper presents a novel encoder-only transformer model for melodic harmonization, trained using a discrete diffusion-inspired method for gradually unmasking harmony tokens. (The code is available at https://github.com/NeuraLLMuse/GridMLMelHarm (accessed on 22 August 2025) under the Apache 2.0 license. The dataset is subject to copyright and is available upon request.) The melody is provided as input in a binary piano roll format augmented with pitch class representations, which are directly embedded and injected into the encoder. The output of the model is a sequence of chord symbols that cover segments of the melody, reflecting pure harmonic rhythm (points where chord symbols change, disregarding chord repetitions). The key contributions are as follows:

We propose an encoder-only Transformer architecture for melodic harmonization generation, avoiding the limitations of autoregressive decoders and enabling efficient bidirectional context modeling.

We introduce two discrete diffusion-inspired training schemes for symbolic music harmonization implemented through progressive unmasking strategies compatible with masked language modeling objectives.

We design and evaluate a binary piano roll melody representation, incorporating both pitch-time and pitch class-time grids, that is injected directly into the encoder input stream for effective melody-to-harmony conditioning.

We evaluate the proposed model against autoregressive baselines trained and tested on the same datasets, highlighting that our non-autoregressive, context-aware generation approach is a viable alternative to autoregressive models that runs faster (constant number of model calls regardless of harmonization length) and additionally enables interactive controllability through predefined chord constraints. The purpose and usefulness of melodic harmonization systems that enable the use of chord constraints has been well documented in the literature. The purpose of this paper is to examine the proposed architecture and the diffusion-inspired unmasking strategies. The evaluation chord constraint capabilities would require extensive subjective evaluation with music experts, which is beyond the scope of this paper. An online version (https://huggingface.co/spaces/NeuraLLMuse/masked-mel-harm-gradio, accessed on 22 August 2025) of the proposed system variations that accepts chord constraints is available for exploration.

2. Method

This section presents the symbolic music representation, the encoder architecture with stage-aware conditioning, and the progressive unmasking strategy of our model. We begin by describing the discretized input encoding for the melody, harmony, and time signature, followed by the architectural components and training dynamics.

2.1. Input Representation

We represent each musical piece on a fixed temporal grid based on 16th-note intervals, corresponding to four positions per beat. All events are quantized and aligned to this grid to ensure temporal consistency across samples.

The melody is encoded using two components concatenated along the feature axis: a binary piano roll and a pitch class roll (pc-roll). The piano roll is a binary matrix $m \in {0, 1}^{T \times P}$ , where $T = 256$ is the number of 16th-note time steps and $P = 88$ corresponds to the MIDI pitches in the range from 21 (A0) to 108 (C8). The pc-roll is a binary matrix $c \in {0, 1}^{T \times 12}$ encoding the pitch class (chroma) of the melody notes at each timestep; a similar melody encoding scheme was followed in [6]. This chroma channel encourages the model to reason over the harmonic context rather than the surface pitch alone. The concatenated melody input thus has a shape of $(T, 100)$ .

The harmony is represented as a sequence of discrete chord tokens drawn from a fixed vocabulary ( $V$ ), denoted as $y \in V^{T}$ . To ensure uniformity across enharmonic or symbolic variants (e.g., Cmaj7 vs. C▵), chord symbols are normalized using the MIR_eval [24] standard (e.g., C:maj7). The vocabulary consists of $12 \times 29 = 348$ chord types, where 12 denotes the pitch class of the root and 29 denotes the possible chord qualities. The harmony is aligned to the 16th-note grid but changes less frequently than the melody. Each chord is repeated across all time steps that fall within its duration. For instance, if C:maj7 spans two beats, then it occupies eight consecutive grid positions. If no chord is present at some point in the harmonization, then a special “no chord” token (<nc>) is employed, while a sequence of trailing <pad> tokens is used to fill the harmonization if it finishes before the fixed 256-token duration.

A separate vector, $g \in {0, 1}^{16}$ , encodes the time signature. This is a 16-dimensional binary vector where the first 14 bits represent the numerator (encoded as a one-hot from 2 to 15) and the final 2 bits represent the denominator (encoded as a one-hot for 4 or 8). This vector is prepended to the sequence and enables the model to capture the metrical structure.

Thus, the full input sequence has length of 513, with 1 position for the time signature and 256 positions each for the melody and harmony. The time signature vector is prepended at position 0. After that, at each timestep, the input includes the 100-dimensional melody piano roll and pc-roll (positions 1–257) and a chord token (positions 258–513). During training, only chord tokens may be masked, depending on the stage (see Section 2.3).

2.2. Model Architecture

The core of the proposed method is an encoder-only Transformer architecture inspired by BERT [25] and adapted for generation via masked language modeling (MLM). The model predicts chord tokens conditioned on a fixed time signature, the melodic context, and the visible (i.e., unmasked) portion of the harmony. During inference, the harmony sequence is initially fully masked using <mask> tokens.

At each successive unmasking stage t, a stage embedding vector is computed by the model such that $s_{t} \in R^{d}$ . This embedding provides the model with information about the current training or generation stage, analogous to the timestep embeddings used in diffusion models, allowing it to adjust its predictions based on how much of the harmony is already visible. At each unmasking stage, a subset of partially masked harmony tokens $y_{in}^{(t)}$ is provided as part of the overall input. Some portion of those masked tokens needs to be predicted and unmasked, depending on t, progressively revealing the complete harmony sequence. During training, the model learns this iterative unmasking process by estimating the conditional distribution over the target tokens given the visible context:

(1) $p_{θ} (y_{target}^{(t)} ∣ y_{in}^{(t)}, g, m, c, t),$

where

y_{target}^{(t)}

contains the subset of harmony tokens to be predicted at stage t. This formulation enables harmonization in a non-autoregressive manner while preserving the ability to condition on both the melodic input and previously revealed chords.

Figure 1 illustrates the proposed architecture. The key differences from a standard Transformer encoder are as follows:

Binary melodic and time signature inputs: The inputs for the time signature ( $g$ ) and melody and pitch classes ( $m$ and $c$ , respectively) are binary vectors rather than token embeddings. The corresponding model output of those components is ignored during both training and inference (denoted as “ignored output” in Figure 1). The model is trained only to predict harmony tokens.

Trainable positional embeddings: Positional embeddings, $p$ , are randomly initialized and trained jointly with the model parameters. This design allows the model to learn how relative positions in the 16th-note grid correspond to meaningful temporal relationships between the melody and harmony, rather than imposing a fixed positional prior.

Stage-awareness conditioning: A stage embedding layer ( $E_{s}$ ) receives an integer index indicating the current unmasking stage (t). The stage embedding vector ( $s_{t} = E_{s} (t)$ ) is replicated for 513 “time steps” and concatenated to the output of the positional embedding layer (which includes 513 “time steps”). This augmented input is then passed through a projection $W_{x}$ that maps the concatenated input to the dimensionality of the Transformer encoder. This is the final input to the encoder.

The overall input of the model is formed by passing the time signature information ( $g$ ), melody ( $m$ ) concatenated with the pitch class ( $c$ ) for each time step, and currently unmasked harmony ( $y_{in}^{(t)}$ ) through their dedicated trainable feedforward layers (The trainable layer for $y_{in}^{(t)}$ , annotated as $E_{y}$ , is actually an embedding layer, i.e., linear with no bias, since it is intended to map one-hot tokens to the dimensionality of the model.) for matching the dimensionality of the transformer model. The final input of the model is obtained by

(2) $z^{(t)} = concat [W_{g} (g), W_{m}_c (concat [m, c]), E_{y} (y_{in}^{(t)})],$

At each step, this overall input embedding is combined with positional and stage information to form the final input to the Transformer encoder model as follows:

(3) $x^{(t)} = W_{x} (concat [z^{(t)} + p, s_{t}]),$

where

W_{x}

is a linear projection layer that maps the concatenated input to the dimensionality of the encoder.

2.3. Diffusion-Inspired Unmasking in Training and Generation

Two strategies are examined for progressively revealing masked harmony tokens during training and inference. Both are inspired by the denoising process of diffusion models but adapted to the discrete nature of our symbolic music task. In our formulation, the harmony tokens are initially fully masked and gradually unmasked following the procedures below:

Random $n$ % Unmasking: At each stage t, n% of the remaining masked harmony tokens are randomly selected and unmasked. This introduces stochasticity and encourages the model to generalize across diverse partially observed contexts. While any value of n can be valid under this procedure, values of 5 and 10 are examined in the results.

Midpoint Doubling: Inspired by binary subdivision and the hierarchical structure of music, this deterministic strategy reveals tokens at the midpoints between previously unmasked tokens, effectively doubling the number of visible tokens at each step. This results in a structured, coarse-to-fine unmasking trajectory.

Both strategies can be viewed as discrete analogues of diffusion processes, where masking corresponds to the addition of noise and prediction corresponds to denoising.

According to each unmasking strategy, let $M^{(t)} \subseteq {1, \dots, T}$ be the set of masked token positions at stage t and $U^{(t)} \subseteq M^{(t - 1)}$ be the set of tokens newly selected for unmasking. We have that $M^{(t - 1)} = M^{(t)} \cup U^{(t)}$ . The input sequence is defined element-wise as follows:

(4) $y_{i}^{(t)} = \{\begin{matrix} y_{i}, & i \notin M^{(t)} \\ < mask >, & i \in M^{(t)} \end{matrix}$

and the prediction targets at this step are

(5) $y_{target}^{(t)} = {y_{i} ∣ i \in {(M^{(t)})}^{c}},$

where

{(M^{(t)})}^{c}

is the complementary set of

M^{(t)}

in the set

{1, \dots, T}

, i.e., all the tokens that are not masked in unmasking stage t.

During training, we minimize the masked language modeling loss by computing the cross-entropy loss only over the newly unmasked tokens at each stage:

(6) $L^{(t)} = - \sum_{i \in {(M^{(t)})}^{c}} log p_{θ} (y_{i} ∣ y_{in}^{(t)}, g, m, c, t) .$

The full loss is the sum over all stages:

(7) $L = \sum_{t = 1}^{T_{s}} L^{(t)},$

where

T_{s}

is the total number of unmasking stages.

For midpoint doubling, the necessary steps to unmask a sequence of s time steps is $l o g_{2} (s) + 1$ , since every step doubles the total tokens that are unmasked. In the current application, we have that every harmonic sequence has a length $s = 256$ , and therefore $T_{s} = 9$ . For comparable unmasking steps between the two examined unmasking strategies, we initially set $n = 10 %$ in the random n% unmasking, and thus $T_{s} = 10$ . Since 10 unmasking steps could be considered a relatively small number compared with the total of 256 tokens that need to be revealed (on average, 25.6 tokens per step are revealed), we also tested for $n = 5 %$ (leading to $T_{s} = 20$ ) to examine whether there was any improvement in introducing more steps. It would be possible to perform more elaborate sensitivity analysis for finding the optimal value of n for the random $n %$ strategy, but given the high cost of resources for training, in this paper, we only heuristically examine $n = 10 %$ and $n = 5 %$ .

Practically, for each item in the batch at each training step, the following occurs:

a. A stage index t is sampled, which determines the unmasking level and which tokens are to be predicted.

b. A partial “visible” harmony sequence $y_{in}^{(t)}$ is defined using one of the unmasking strategies, and the set of target tokens to be learned is identified ( $M^{(t)})^{c}$ ).

c. The model is trained to predict the current target-masked tokens using the melody and the partially visible harmony context.

The melody and stage conditioning are fully visible throughout, while the harmony is incrementally revealed across the stages.

At inference time, the harmony sequence is initially set to all <mask> tokens. The model is then applied iteratively, following one of the two unmasking schedules until the full sequence is generated. For the random n% strategy, new positions to unmask can be selected either by choosing the top n% most confident predictions or by sampling based on the model’s confidence distribution. In the midpoint doubling strategy, the unmasking schedule is strictly deterministic. In both cases, once the next positions are selected, their token values can be assigned using sampling strategies from the predicted token distributions:

(8) ${\hat{y}}^{(t)} \sim p_{θ} (\cdot ∣ y_{in}^{(t)}, g, m, c, t),$

and the input is updated as follows:

(9) $y_{in}^{(t + 1)} = y_{in}^{(t)} \cup {\hat{y}}^{(t)} .$

3. Experimental Set-up and Dataset

The proposed method was evaluated in both in-domain and out-of-domain scenarios against autoregressive baselines, along with two ablated versions. Two methods of comparison were employed that showed different aspects of the generated music: one based on symbolic music metrics and one based on the audio rendering of the symbolic music. This section presents the models under comparison, dataset and training details, and the evaluation metrics and protocols.

3.1. Model Comparison

We compared the two variants (random N% and midpoint doubling) of the proposed model with two autoregressive baselines. The two proposed model variants include two ablated versions that isolate the effects of specific design choices (see the dash-bordered boxes in Figure 1):

i. Random Unmasking (Rn): Trained and generated using the random n% (for $n = 10$ and $n = 5$ ) unmasking strategy described in Section 2.3;

ii. Midpoint Doubling (MD): Trained and generated using the midpoint doubling schedule, including both pitch class roll input and stage-aware embeddings.

iii. No pc-roll (Rn-NPC and MD-NPC): Identical to (i) and (ii) but without the pitch class roll input; only the binary 88-note piano roll is used to represent the melody;

iv. No Stage Awareness (Rn-NS and MD-NS): Identical to (i) and (ii) but without stage embeddings, making the model unaware of its position in the unmasking process.

We also included two autoregressive baselines:

v. GPT-2 [26]: A decoder-only transformer trained to autoregressively generate harmony tokens conditioned on a symbolic melody sequence.

vi. BART [27]: An encoder-decoder transformer where the encoder processes the melody and the decoder autoregressively generates the harmony.

To align with common autoregressive practice, both baseline models used a different tokenization scheme from the proposed approach. In particular, melody sequences were tokenized as sequences of discrete symbols using one-hot tokens, rather than the binary piano roll and pitch class roll representations used in our model. The autoregressive token vocabulary included standard special tokens (e.g., <s>, <e>, <mask>, <unk>, and <pad>), along with additional domain-specific tokens, namely <bar>, indicating the start of a new bar;<rest>, marking rests; position_BxSD, indicating the onset position within a bar (with B being the beat and SD an eighth-step subdivision); and P:X, representing the MIDI pitch value.

The onset format BxSD quantizes the beat subdivisions into eight values (used subdivisions: $0, 0.16, 0.25, 0.33, 0.5, 0.66, 0.75, 0.83$ ) A single time signature was assumed per piece and encoded as ts_NxD, where N is the numerator and D the denominator (either 4 or 8).

Harmony tokenization shares special tokens (e.g., padding, bar, and position) with melody tokenization and begins each sequence with a special <h> token. Chords are represented using MIR_eval [24] chord symbols, consistent with the proposed model. Unlike our model, autoregressive models do not repeat chord symbols since the chord position and duration are encoded explicitly via position tokens.

For the comparison with baseline models to be meaningful and fair, it was necessary to develop the above-mentioned baseline models, since specific fundamental limitations make existing state-of-the-art (SoA) models incompatible for comparison. The first limitation is that SoA models that perform melodic harmonization (e.g., [5,6,8]) do not consider a strict definition of harmonic rhythm, as we do in our model. In other words, SoA models consider chord rhythmic patterns via chord repetitions within bars, a fact that makes their output incompatible for comparison with the ground truth we used and with the output of our model. A second limitation is that SoA models either output accompaniment instead of harmonization [5], i.e., discrete notes that potentially form elaborate patterns that need an additional inference step to be transformed into chord symbols, or use chord symbols from a restricted dictionary [6,8]. For comparison, in [6], 6 chord qualities for all 12 roots were considered, leading to 72 total chord symbols, while our approach considers 29 qualities for 12 roots, leading to 348 chord symbols in total. Therefore, a direct comparison with SoA models would require significant post-processing and additional inference steps that would possibly distort their output and lead to unclear results.

3.2. Data and Training

All models were trained on data splits from the HookTheory dataset [2], which contains 15,440 pieces in MIDI format. The pieces in the dataset were modified to reflect harmonic rhythm, i.e., locations where chords change within each bar. Chord repetitions that reflect rhythm beyond the harmonic rhythm were removed. The only chord repetition that was allowed was at the beginning of each bar if the starting chord of a bar was the ending chord from the previous bar. To address tonal imbalance, we applied key normalization; major-mode pieces were transposed to C major and minor-mode pieces to A minor using the Krumhansl key finding algorithm [28]. This follows prior work [6,7] and leverages the shared pitch class structures of C major and A minor [29]. We used a 95%/5% split for training and validation and testing, resulting in 14,679 training and 761 validation and test pieces. We evaluated the models trained on the training set in two distinct settings:

In domain: In this setting, models were evaluated on the validation and test split of the HookTheory dataset.

Out of domain: Here, evaluation was conducted on a separate collection of 650 jazz standard lead sheets, again transposed to C major or A minor using the Krumhansl key profiles.

Given the scarcity in high-quality, large-scale data for melodic harmonization, we followed an in-domain and out-of-domain evaluation approach as demonstrated in [6]. Therefore, training and in-domain evaluation were performed on the Hooktheory dataset, and a jazz dataset was employed for out-of-domain evaluation. In contrast to [6], we did not use the Chord Melody Dataset (CMD) [30], since it includes pieces with some restrictions regarding the number of chords in each bar and restrictions in time resolution for the notes. We used our own curated dataset of jazz standard melody harmonizations, which is also larger in comparison with the CMD (650 vs. 473 pieces).

For all autoregressive baselines, we restricted the maximum input length to 16 bars, corresponding to 256 tokens for both melody and harmony sequences. Evaluation was likewise performed on 16-bar segments.

Optimization used the Adam optimizer with cosine learning rate scheduling and a 5% warm-up phase. The models were trained for 50 epochs on the normalized (from C major to A minor) HookTheory dataset, and validation was performed at the end of each epoch. We saved model checkpoints when validation loss reached a new minimum.

3.3. Evaluation Metrics and Protocols

First, the training performance was assessed for each variation of the proposed method by measuring the validation loss, accuracy, perplexity, and normalized token entropy. Then, we evaluated harmonization quality using two distinct methods of comparison by harmonizing melodies from the evaluation datasets. One focused on interpretable metrics that could be measured from the symbolic representation of the data, and one compared the embeddings of a pretrained symbolic music model. The symbolic music metrics provide a detailed picture of the advantages and disadvantages of the compared methods in comparison to the ground truth, i.e., the original harmonization in the evaluation dataset, while the embeddings evaluation provides an overall assessment of how similar the generated harmonization was to the original or to a representative piece of a given dataset.

3.3.1. Training Performance Assessment

The validation set loss and accuracy for each variation of the proposed method indicate how well each variation captured the distribution to be learned. Accuracy was measured as the percentage of times that the argmax of the logits corresponded to the actual tokens in the validation set, averaged across the unmasking stages. Since the latter stages were provided with a larger number of unmasked tokens, the model was more or less carrying out the easy task to fill in token repetitions, and therefore it was expected that the accuracy values would be generally high. Two additional metrics were considered alongside loss and accuracy. The first was perplexity, which accounts for the model’s uncertainty normalized over the sequence length and offers a measure of predictive confidence that is independent of absolute token count. The second was normalized token entropy, which provides a measure of distributional uncertainty adjusted for the vocabulary size.

The perplexity was computed for each tokenized sequence as follows:

(10) $ppl = \frac{1}{S} \sum_{t = 1}^{S} e^{- \frac{1}{T} \sum_{j = 1}^{T} \ln p_{θ} (x_{j} ∣ y_{in}^{(t)}, g, m, c, t)},$

where T is the number of tokens in the sequence and

p_{θ} (x_{j} ∣ y_{in}^{(t)}, g, m, c, t)

denotes the probability assigned by the model to the correct token at position j given the entire context. This measure of perplexity is the average perplexity across all unmasking stages. Perplexity quantifies the model’s average uncertainty when predicting each token. A value of 1 indicates perfect certainty and correctness at every step, while a value of, for example, 4 implies that the model’s predictions are as uncertain as choosing uniformly among four equally likely options.

Normalized token entropy was computed for each sequence as follows:

(11) $\tilde{H} = \frac{- 1}{S \cdot T \cdot {log}_{2} | V |} \sum_{t = 1}^{S} \sum_{i = 1}^{T} \sum_{j = 1}^{| V |} p_{θ} (x_{i}^{j} ∣ y_{in}^{(t)}, g, m, c, t) {log}_{2} p_{θ} (x_{i}^{j} ∣ y_{in}^{(t)}, g, m, c, t),$

where

| V |

is the vocabulary size and

p_{θ} (x_{i}^{j} ∣ y_{in}^{(t)}, g, m, c, t)

is the model’s predicted probability of a token at vocabulary index j at sequence position i given the context. This measure captures the average entropy of the model’s full predictive distribution across all unmasking stages and all positions, normalized by the maximum possible entropy

{log}_{2} | V |

. A value of 0 indicates that the model is entirely confident in its predictions (assigning all probability mass to a single token), whereas a value of 1 suggests that the model is maximally uncertain.

3.3.2. Symbolic Music Metrics

This evaluation process incorporated a comprehensive set of music-specific metrics that assess three core aspects: chord progression structure, harmony–melody alignment, and rhythmic coherence. The first two categories, chord progression and chord–melody harmonicity, follow established frameworks [2] and have been widely adopted [6,7,31]. The third, harmonic rhythm, focuses on the temporal placement of chords and complements our analysis of token positioning [32].

Chord Progression Coherence and Diversity

(i)

Chord histogram entropy (CHE) measures how evenly chords are distributed in a piece. Higher values reflect greater harmonic variety.

(ii)

Chord coverage (CC) counts the number of distinct chord types used, indicating the breadth of harmonic vocabulary.

(iii)

Chord tonal distance (CTD) computes the average tonal distance between adjacent chords, where lower values suggest smoother, more connected progressions.

Chord–Melody Harmonicity

(i)

The chord-tone-to-non-chord-tone ratio (CTnCTR) measures the proportion of melody notes that match chord tones or are near passing tones. Higher values imply stronger harmonic support for the melody.

(ii)

The Pitch Consonance Score (PCS) assigns consonance scores to melody–chord intervals based on standard musical intervals. Higher scores indicate more consonant melodic writing.

(iii)

The melody–chord tonal distance (MCTD) evaluates the average tonal distance between the melody notes and underlying chords. Lower values indicate closer harmonic alignment.

Harmonic Rhythm Coherence and Diversity

(i)

Harmonic rhythm histogram entropy (HRHE) measures the diversity in the timing of chord changes. Higher entropy suggests more rhythmically varied progressions.

(ii)

Harmonic rhythm coverage (HRC) counts distinct rhythmic patterns of chord placement. Higher values indicate a wider range of rhythmic usage.

(iii)

The chord beat strength (CBS) scores how aligned chord onsets are with the metrical strength. Lower scores imply alignment with strong beats, and higher scores reflect more syncopated rhythms.

3.3.3. FMD

In addition to symbolic evaluations, we assessed harmonization quality using the Fréchet Music Distance [33], a metric adapted from the Fréchet Inception Distance (FID), which compares the embedding distributions of generated outputs to a reference set. It compares the distributions of high-level musical embeddings to evaluate the quality of compositions in MIDI and score form. The use of in-domain and out-of-domain reference sets applied here as well. Two FMD scores were calculated, namely the FMD (internal), where generated outputs were compared to reference MIDIs (“real” subset) for both the in-domain and out-of-domain set-ups, and FMD (POP909), where trained models were compared to the standard versions of the MIDIs existing in the POP909 Dataset [34]:

(12) $FD = {∥μ_{r} - μ_{e}∥}^{2} + Tr (Σ_{r} + Σ_{e} - 2 \sqrt{Σ_{r} Σ_{t}}),$

where

μ_{r}, μ_{e}

are the mean vectors and

Σ_{r}, Σ_{e}

are the covariance matrices of the reference and test distributions, respectively, while

Tr (\cdot)

is the matrix trace.

For the FMD calculation, we extracted symbolic music embeddings using the CLaMP2 encoder [35]. We further assessed model-level differences for the domains using non-parametric paired tests on the per-piece metric results. For each model comparison, we performed Wilcoxon signed-rank tests on the matched pairs of scores (each pair representing the same musical piece evaluated under two conditions) to determine whether any observed improvement was statistically significant [36]. To control for multiple comparisons, we applied a Benjamini–Hochberg false discovery rate correction, marking differences as significant only if they met a corrected $α = 0.05$ threshold.

4. Results

This section presents the evaluation of the generated melodic harmonizations against ground-truth harmonic sequences. All compared models shared similar architectural profiles, with eight layers and eight attention heads per layer (both for the encoder and decoder in the case of BART). The total number of parameters was comparable across the models, being approximately 26 million for the proposed encoder-only model and around 35 million for BART. Training was performed on an NVIDIA RTX 3080 GPU (11 GB VRAM), using a batch size of 20 for approximately 12 h per model. For each training session, the model with the best validation loss was saved, and the results are reported using these models.

During inference, the autoregressive models generated results using beam search with five beams, because the produced results had a better long-term structure than using any other sampling method. Additionally, with beam search, the generated results were consistent across runs and better reflected the “best take” of the models in each melodic harmonization task. For the proposed method, a simple multinomial sampling from Equation (8) was performed with a temperature of 1.0. The temperature value of 1.0 was decided because the results were not much different in comparison with various temperature values between 0.5 and 1.5, and we considered that it would be better to therefore use the temperature value that kept the logit distribution unaltered.

4.1. Training Performance

Table 1 shows the validation loss, accuracy, perplexity, and normalized token entropy results that corresponded to the best (smallest) validation loss achieved during training for the examined variations and the ablated versions of the proposed model. The Rn versions achieved better validation loss with similar accuracy to the MD versions. With the exception of the R05-NS ablation, which achieved the “best” values across the board, the non-ablated versions achieved marginally “better” performance for all metrics from their ablated counterparts. The terms “best” and “better” are used in quotes to indicate that superiority in the training metrics is not necessarily reflected in the music-related metrics, as we will see in the following subsections.

Accuracy parity among variations shows that the winning token predictions were similar across the variations; what changed in loss was the logit values that provided accurate argmax values. Smaller losses were achieved by the models that were more certain about their decisions, i.e., where the zeros and the ones of the target one-hot distributions were expressed with logits that were closer to zero and one, respectively. This is also reflected by the perplexity and normalized token entropy values, which were correlated with the loss values.

4.2. Music Metrics

On the in-domain HookTheory validation set (Table 2), our proposed non-autoregressive variants consistently surpassed the autoregressive baselines across most music-theoretic dimensions. Specifically, the full MD model demonstrated superior chord progression variety (CHE), highlighting its ability to generate harmonically diverse sequences. The MD-NPC variant, despite lacking pitch class information, notably excelled in harmonic rhythm complexity and rhythmic diversity (HRHE and HRC, respectively), suggesting a unique emphasis on rhythmic structuring independent of chromatic detail. The MD-NS variant, without explicit stage-awareness, stood out by effectively recovering a wide chord vocabulary (CC) and achieving natural rhythmic placement (CBS), indicating that positional context alone provided meaningful structural cues.

The stochastic variants (Rn) performed remarkably well across all chord–melody harmonicity metrics (CTnCTR, PCS, and MCTD), reaffirming their strength in aligning chord choices closely with melodic content. However, this harmonic alignment came at the expense of diminished harmonic and rhythmic variety, reflecting their tendency toward simpler chord progressions and less diverse rhythmic structures. The same happened with the autoregressive baselines (GPT-2 and BART), which were reasonably effective at achieving chord–melody harmonicity—though still below the stochastic variants—but demonstrated notable limitations in chord progression and rhythmic coherence metrics. Their comparatively weaker performance in terms of harmonic diversity and rhythm highlights the inherent limitations associated with sequential, unidirectional decoding, particularly when modeling harmonically and rhythmically complex musical contexts.

When evaluated with the out-of-domain jazz standards (Table 3), the performance characteristics shifted notably. The MD-NS model emerged prominently by demonstrating superior chord progression coherence and diversity across key related metrics (CHE, CC, and CTD). Remarkably, it also excelled in chord-beat alignment (CBS), reinforcing the effectiveness of positional regularities alone for capturing sophisticated rhythmic and harmonic structures in unfamiliar musical contexts. The absence of explicit stage-awareness embeddings seemed to benefit generalization by encouraging the model to infer richer harmonic structures from temporal positions. On the other hand, the MD-NPC variant excelled specifically at rhythmic complexity metrics (HRHE and HRC), underlining the role of rhythmic structuring independent of detailed pitch class information.

The stochastic unmasking variant (R10) again achieved strong chord–melody alignment (CTnCTR and MCTD), while a similar variant (R05-NS) topped the pitch consonance score (PCS). However, these high alignment metrics appear to be connected to their comparatively limited harmonic and rhythmic diversity, mirroring the pattern observed in the in-domain evaluation. The consistently simpler harmonic vocabulary and restricted rhythmic complexity may facilitate closer melodic-harmonic consonance. Finally, BART and GPT-2 significantly underperformed across all music metrics, including those related to chord–melody harmonicity, contrasting the previous in-domain results, which showed relatively better melodic-harmonic alignment. This substantial drop in out-of-domain generalization performance further underscores the advantages of the proposed non-autoregressive, diffusion-inspired strategies for modeling harmonically and rhythmically rich musical styles outside of their original training distribution.

4.3. FMD Scores

The results in Table 4 and Figure 2 show that when the internal datasets for validation were used, the diffusion-inspired non-autoregressive variants consistently achieved the best scores, with the stage-aware and pitch class-aware MD-NS model (FMD = 7.7068) leading in the domain and the MD-NPC model leading out of the domain (FMD = 55.15). Statistical comparisons using FDR-corrected Wilcoxon tests show that these differences were significant in most adjacent model pairs, particularly between the top-performing MD variants and the first tier of random baselines (e.g., R10 and R05). Autoregressive baselines like GPT-2 and BART performed substantially worse, indicating limited generalization under the FMD scores.

When evaluated with POP909-based embeddings, all models showed closer performance, with the MD models leading in the domain, followed by the baseline models. On the contrary, the baseline models performed better for the out-of-domain harmonization task (FMD = 638.06), i.e., the pop style POP909 was better aligned with the output of the baselines, even though the task was to produce jazz outputs. The low POP909 FMD scores of the baseline models (e.g., GPT-2 and BART) in both settings might suggest that they “rigidly” produce pop-style outputs even if they are prompted with jazz-style melodies, therefore remaining closer to pop-style harmonic distributions. This fact, in combination with the better out-of-domain FMD scores of our method, indicates that our methodology is effectively more flexible in capturing out-of-domain nuances. In contrast, the baseline models will follow the training style less flexibly.

5. Conclusions

This paper introduced a novel approach to melodic harmonization using a non-autoregressive, encoder-only transformer trained under a discrete diffusion-inspired objective. By formulating harmonization as a progressive unmasking task on a fixed time grid, our model generates harmony sequences through iterative refinement rather than left-to-right decoding. The framework employs a piano roll representation for the melody, with additional experiments exploring the impact of pitch class conditioning using a pitch class roll.

Empirical evaluations on both in-domain (HookTheory) and out-of-domain (jazz standards) datasets show that our models consistently outperformed autoregressive baselines (GPT-2 and BART) in the key metrics of harmonic diversity, rhythmic variety, and melody–harmony alignment. They also achieved the lowest Fréchet Music Distance (FMD) scores relative to the ground truth data, indicating that bidirectional refinement enables more musically coherent and perceptually convincing harmonizations. Ablation studies further underscored the importance of pitch class information and the structure of the unmasking schedule for stylistic generalization.

Despite these strengths, several limitations remain. The current unmasking schedule is hand-designed. Future work could explore mathematically informed [37], learnable policies or even continuous diffusion on the level of logits [38]. The midpoint doubling unmasking strategy is well motivated by the hierarchical nature of musical structure. However, indicating bar locations with special tokens could enable more elaborate strategies that adapt to bar-level organization, e.g., starting harmonization from the last bar containing melody notes. Additionally, while the model generalizes reasonably well to jazz standards, broader evaluation across diverse musical traditions and genres is necessary to assess its full potential. Future directions also include subjective evaluation of the framework’s flexibility for interactive use cases, such as enabling user-defined chord constraints.

Overall, this work positions discrete diffusion-inspired modeling as a compelling alternative to autoregressive approaches for symbolic melodic harmonization. It offers faster execution (a constant number of model calls regardless of the harmonization length), greater structural control, stylistic versatility, and the potential for expansion into interactive and adaptive music generation systems.

Author Contributions

Conceptualization, M.K.-P., E.C. and V.K.; methodology, M.K.-P., D.M., K.S. and K.-T.T.; software, M.K.-P., D.M. and K.S.; validation, D.M., K.S. and K.-T.T.; formal analysis, D.M., K.S. and K.-T.T.; resources, V.K. and M.K.-P.; data curation, D.M. and K.-T.T.; writing—original draft preparation, M.K.-P.; writing—review and editing, all authors; supervision, M.K.-P. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data are subject to copyright and are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LSTM	Long short-term memory
VAE	Variational auto encoder
MLM	Masked language modeling
BERT	Bidirectional encoder representations from transformers
BART	Bidirectional and auto-regressive transformers
GPT-2	Generative Pretrained Model 2
MD	Midpoint doubling
Rn	Random n%
NPC	No pitch classes
NS	No stage
ppl	Perplexity
CHE	Chord histogram entropy
CC	Chord coverage
CTD	Chord tonal distance
CTnCTR	Chord-tone-to-non-chord-tone ratio
PCS	Pitch Consonance Score
MCTD	Melody–chord tonal distance
HRHE	Harmonic rhythm histogram entropy
HRC	Harmonic rhythm coverage
CBS	Chord beat strength
FID	Fréchet Inseption Distance
FMD	Fréchet Music Distance

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Model overview.

Figure 2 FMD scores.

Table 1

Validation data metrics that correspond to the training epoch with the best validation loss.

	MD	R10	R05	MD-NPC	R10-NPC	R05-NPC	MD-NS	R10-NS	R05-NS
loss	0.0625	0.0486	0.0498	0.0661	0.0497	0.0499	0.0669	0.0486	0.0437
acc	0.9860	0.9868	0.9869	0.9846	0.9873	0.9870	0.9846	0.9871	0.9883
ppl	1.0651	1.0503	1.0516	1.0690	1.0512	1.0515	1.0698	1.0500	1.0451
$\tilde{H}$	0.0737	0.0300	0.0324	0.0768	0.0279	0.0271	0.0735	0.0259	0.0211

Table 2

Quantitative evaluation in the in-domain validation dataset. Mean values were calculated, and the closest values to the ground truth are bolded.

Model	CHE	CC	CTD	CTnCTR	PCS	MCTD	HRHE	HRC	CBS
GT	1.4126	4.9841	0.9743	0.8369	0.4745	1.3467	0.5432	2.3139	0.3413
MD	1.3484	4.5963	0.7491	0.6848	0.2912	1.5164	0.4461	2.1279	0.1271
MD-NPC	1.3276	4.6042	0.7360	0.6805	0.2839	1.5162	0.4887	2.2374	0.1415
MD-NS	1.3315	4.9208	0.8091	0.7013	0.3102	1.5019	0.7521	3.0422	0.2285
R10	0.4487	1.9683	0.3970	0.8168	0.4234	1.3961	0.2220	1.5369	0.0601
R10-NPC	0.1341	1.2889	0.1071	0.8037	0.4140	1.4151	0.0871	1.2241	0.0246
R10-NS	0.2978	1.6385	0.2731	0.7945	0.4052	1.4201	0.1762	1.4261	0.0506
R05	0.0406	1.0712	0.0269	0.8092	0.4214	1.4099	0.0269	1.0474	0.0053
R05-NPC	0.4088	1.7678	0.3319	0.7496	0.3897	1.4542	0.1642	1.3746	0.0422
R05-NS	0.1659	1.3403	0.1498	0.7987	0.4076	1.4183	0.0671	1.1583	0.0179
BART	1.0248	3.1306	0.9595	0.7703	0.4119	1.4292	0.0787	1.1358	0.1863
GPT-2	0.7991	2.5725	0.7786	0.7644	0.3962	1.4416	0.0144	1.0237	0.0605

Table 3

Quantitative evaluation on the out-of-domain test dataset. Mean values were calculated, and the closest values to the ground truth are bolded.

Model	CHE	CC	CTD	CTnCTR	PCS	MCTD	HRHE	HRC	CBS
GT	2.2043	11.6558	0.8823	0.8320	0.3169	1.4028	0.5107	2.0570	0.2468
MD	1.4138	5.1311	0.6023	0.6028	0.2355	1.5650	0.2889	1.8707	0.0631
MD-NPC	1.4231	5.2895	0.5846	0.6064	0.2345	1.5673	0.3874	2.2171	0.0814
MD-NS	1.4562	5.8304	0.7592	0.6129	0.2486	1.5559	0.6899	3.2133	0.1722
R10	0.6084	2.6254	0.3580	0.7125	0.3406	1.4514	0.2341	1.7338	0.0502
R10-NPC	0.1296	1.3181	0.0968	0.6858	0.3181	1.4875	0.1102	1.2971	0.0258
R10-NS	0.2679	1.6742	0.1851	0.6844	0.3185	1.4854	0.1801	1.5181	0.0432
R05	0.0385	1.0760	0.0219	0.6881	0.3220	1.4849	0.0169	1.0456	0.0032
R05-NPC	0.5698	2.2038	0.4438	0.6396	0.2966	1.5246	0.2153	1.5847	0.0422
R05-NS	0.1396	1.3219	0.0947	0.6836	0.3177	1.4883	0.0508	1.1600	0.0114
BART	0.9422	3.1749	0.6604	0.4523	0.2442	1.6567	0.0359	1.0627	0.0327
GPT-2	0.4724	1.8783	0.4281	0.4548	0.2441	1.6649	0.0017	1.0039	0.0016

Table 4

Internal and external FMD scores (lower is better), sorted by score within domain, with FDR-corrected p values and significance flags (* for $p < 0.05$ ) for adjacent models.

FMD, Internal				FMD, POP909
In-Domain (HookTheory)				In-Domain (HookTheory)
Model	FMD (Internal)	p	Sig	Model	FMD (POP909)	p	Sig
MD-NS	7.7068	—	—	MD-NS	602.2391	—	—
MD	7.8868	0.0402	*	MD	604.7602	0.7767
MD-NPC	7.9254	0.3336		MD-NPC	604.8949	0.7767
R10	15.9320	0.0000	*	BART	606.6170	0.7767
R05-NPC	16.0106	0.0000	*	GPT-2	610.9195	0.1321
R10-NS	18.8581	0.0000	*	R05-NPC	613.5104	0.0145	*
R05-NS	21.2572	0.6148		R10	617.7817	0.0000	*
R10-NPC	21.7507	0.8510		R10-NS	619.8417	0.4983
R05	23.7962	0.8510		R05-NS	622.8565	0.0298	*
BART	35.8410	0.0000	*	R10-NPC	625.5839	0.0039	*
GPT-2	39.4392	0.6754		R05	626.4688	0.7767
Out-of-Domain (JazzStandards)				Out-of-Domain (JazzStandards)
MD-NPC	55.1527	—	—	GPT-2	638.0580	—	—
MD-NS	57.1864	0.0171	*	BART	642.0034	0.0015	*
MD	60.1773	0.4622		MD-NS	653.8520	0.0000	*
R05-NPC	99.1633	0.0000	*	MD	655.2597	0.5850
R10	100.4137	0.0000	*	MD-NPC	656.6548	0.2650
R10-NS	109.6748	0.0000	*	R05-NPC	661.8715	0.5558
R10-NPC	114.7309	0.0010	*	R10	667.1043	0.0000	*
R05-NS	114.8798	0.9814		R10-NS	669.5915	0.5558
R05	118.6610	0.0859		R10-NPC	671.6342	0.2650
BART	154.7940	0.0000	*	R05-NS	671.8228	0.9679
GPT-2	159.8600	0.0004	*	R05	674.9689	0.0141	*

References

1. Lim, H.; Rhyu, S.; Lee, K. Chord generation from symbolic melody using BLSTM networks. arXiv; 2017; [DOI: https://dx.doi.org/10.48550/arXiv.1712.01011] arXiv: 1712.01011

2. Yeh, Y.C.; Hsiao, W.Y.; Fukayama, S.; Kitahara, T.; Genchel, B.; Liu, H.M.; Dong, H.W.; Chen, Y.; Leong, T.; Yang, Y.H. Automatic melody harmonization with triad chords: A comparative study. J. New Music. Res.; 2021; 50, pp. 37-51. [DOI: https://dx.doi.org/10.1080/09298215.2021.1873392]

3. Chen, Y.W.; Lee, H.S.; Chen, Y.H.; Wang, H.M. SurpriseNet: Melody harmonization conditioning on user-controlled surprise contours. arXiv; 2021; arXiv: 2108.00378

4. Costa, L.F.; Barchi, T.M.; de Morais, E.F.; Coca, A.E.; Schemberger, E.E.; Martins, M.S.; Siqueira, H.V. Neural networks and ensemble based architectures to automatic musical harmonization: A performance comparison. Appl. Artif. Intell.; 2023; 37, 2185849. [DOI: https://dx.doi.org/10.1080/08839514.2023.2185849]

5. Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer. arXiv; 2018; arXiv: 1809.04281

6. Rhyu, S.; Choi, H.; Kim, S.; Lee, K. Translating melody to chord: Structured and flexible harmonization of melody with transformer. IEEE Access; 2022; 10, pp. 28261-28273. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3155467]

7. Huang, J.; Yang, Y.H. Emotion-driven melody harmonization via melodic variation and functional representation. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2407.20176] arXiv: 2407.20176

8. Wu, S.; Wang, Y.; Li, X.; Yu, F.; Sun, M. Melodyt5: A unified score-to-score transformer for symbolic music processing. arXiv; 2024; arXiv: 2407.02277

9. Mittal, G.; Engel, J.; Hawthorne, C.; Simon, I. Symbolic music generation with diffusion models. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2103.16091] arXiv: 2103.16091

10. Lv, A.; Tan, X.; Lu, P.; Ye, W.; Zhang, S.; Bian, J.; Yan, R. Getmusic: Generating any music tracks with a unified representation and diffusion framework. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2305.10841] arXiv: 2305.10841

11. Zhang, J.; Fazekas, G.; Saitis, C. Fast diffusion gan model for symbolic music generation controlled by emotions. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2310.14040] arXiv: 2310.14040

12. Atassi, L. Generating symbolic music using diffusion models. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2303.08385] arXiv: 2303.08385

13. Min, L.; Jiang, J.; Xia, G.; Zhao, J. Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2307.10304] arXiv: 2307.10304

14. Li, S.; Sung, Y. Melodydiffusion: Chord-conditioned melody generation using a transformer-based diffusion model. Mathematics; 2023; 11, 1915. [DOI: https://dx.doi.org/10.3390/math11081915]

15. Wang, Z.; Min, L.; Xia, G. Whole-song hierarchical generation of symbolic music using cascaded diffusion models. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2405.09901] arXiv: 2405.09901

16. Huang, Y.; Ghatare, A.; Liu, Y.; Hu, Z.; Zhang, Q.; Sastry, C.S.; Gururani, S.; Oore, S.; Yue, Y. Symbolic music generation with non-differentiable rule guided diffusion. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2402.14285] arXiv: 2402.14285

17. Zhang, J.; Fazekas, G.; Saitis, C. Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation. arXiv; 2025; [DOI: https://dx.doi.org/10.48550/arXiv.2505.03314] arXiv: 2505.03314

18. Chang, H.; Zhang, H.; Jiang, L.; Liu, C.; Freeman, W.T. Maskgit: Masked generative image transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 19–20 June 2022; pp. 11315-11325.

19. Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 17981-17993.

20. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 6840-6851.

21. Jonason, N.; Casini, L.; Sturm, B.L. SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2405.12666] arXiv: 2405.12666

22. Zhang, J.; Fazekas, G.; Saitis, C. Composer style-specific symbolic music generation using vector quantized discrete diffusion models. Proceedings of the 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP); London, UK, 22–25 September 2024; IEEE: New York, NY, USA, 2024; pp. 1-6.

23. Plasser, M.; Peter, S.; Widmer, G. Discrete diffusion probabilistic models for symbolic music generation. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2305.09489] arXiv: 2305.09489

24. Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.; Raffel, C.C. MIR_EVAL: A Transparent Implementation of Common MIR Metrics. Proceedings of the ISMIR; Taipei, Taiwan, 27–31 October 2014; Volume 10, 2014.

25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171-4186.

26. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1, 9.

27. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv; 2019; arXiv: 1910.13461

28. Krumhansl, C.L. Cognitive Foundations of Musical Pitch; Oxford University Press: Oxford, UK, 2001.

29. Hahn, S.; Yin, J.; Zhu, R.; Xu, W.; Jiang, Y.; Mak, S.; Rudin, C. SentHYMNent: An Interpretable and Sentiment-Driven Model for Algorithmic Melody Harmonization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Barcelona, Spain, 25–29 August 2024; pp. 5050-5060.

30. Hiehn, S. Chord Melody Dataset. 2019; Available online: https://github.com/shiehn/chord-melody-dataset (accessed on 21 August 2025).

31. Sun, C.E.; Chen, Y.W.; Lee, H.S.; Chen, Y.H.; Wang, H.M. Melody harmonization using orderless NADE, chord balancing, and blocked Gibbs sampling. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 4145-4149.

32. Wu, S.; Yang, Y.; Wang, Z.; Li, X.; Sun, M. Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density. EURASIP J. Audio Speech Music Process.; 2024; 2024, 4. [DOI: https://dx.doi.org/10.1186/s13636-023-00314-6]

33. Retkowski, J.; Stępniak, J.; Modrzejewski, M. Frechet music distance: A metric for generative symbolic music evaluation. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2412.07948] arXiv: 2412.07948

34. Wang, Z.; Chen, K.; Jiang, J.; Zhang, Y.; Xu, M.; Dai, S.; Bin, G.; Xia, G. POP909: A Pop-song Dataset for Music Arrangement Generation. Proceedings of the 21st International Conference on Music Information Retrieval, ISMIR; Montreal, QC, Canada, 11–16 October 2020.

35. Wu, S.; Wang, Y.; Yuan, R.; Guo, Z.; Tan, X.; Zhang, G.; Zhou, M.; Chen, J.; Mu, X.; Gao, Y. . Clamp 2: Multimodal music information retrieval across 101 languages using large language models. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2410.13267] arXiv: 2410.13267

36. Conover, W.J. Practical Nonparametric Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1999.

37. Sahoo, S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J.; Rush, A.; Kuleshov, V. Simple and effective masked diffusion language models. Adv. Neural Inf. Process. Syst.; 2024; 37, pp. 130136-130184.

38. Lou, A.; Meng, C.; Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribution. arXiv; 2023; arXiv: 2310.16834

Word count: 7874

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

We present a novel encoder-only Transformer model for symbolic music harmony generation, based on a fixed time-grid representation of melody and harmony. Inspired by denoising diffusion processes, our model progressively unmasks harmony tokens over a sequence of discrete stages, learning to reconstruct the full harmonic structure from partial context. Unlike autoregressive models, this formulation enables flexible, non-sequential generation and supports explicit control over harmony placement. The model is stage-aware, receiving timestep embeddings analogous to diffusion timesteps, and is conditioned on both a binary piano roll and a pitch class roll to capture melodic context. We explore two unmasking schedules—random token revealing and midpoint doubling—both requiring a fixed and significantly reduced number of model calls at inference time. While our approach achieves competitive performance with strong autoregressive baselines (GPT-2 and BART) across several harmonic metrics, its key advantages lie in controllability, structured decoding with fixed inference steps, and alignment with musical structure. Ablation studies further highlight the role of stage awareness and pitch class conditioning. Our results position this method as a viable and interpretable alternative for symbolic harmony generation and a foundation for future work on structured, controllable musical modeling.

Details

Title

Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid

Author

Kaliakatsos-Papakostas Maximos¹; Makris Dimos²

; Soiledis Konstantinos²

; Konstantinos-Theodoros, Tsamis²; Katsouros Vassilis³; Cambouropoulos Emilios⁴

¹ Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Institute of Language and Speech Processing, Athena RC, 15125 Marousi, Greece, Archimedes, Athena RC, 15125 Marousi, Greece
² Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Archimedes, Athena RC, 15125 Marousi, Greece
³ Department of Music Technology and Acoustics, Hellenic Mediterranean University, 74100 Rethymno, Greece; [email protected] (D.M.); [email protected] (K.S.); [email protected] (K.-T.T.); [email protected] (V.K.), Institute of Language and Speech Processing, Athena RC, 15125 Marousi, Greece
⁴ School of Music Studies, Aristotle University of Thessaloniki, 57001 Thessaloniki, Greece; [email protected]

First page

9513

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app15179513

ProQuest document ID

3249675488

Diffusion-Inspired Masked Language Modeling for Symbolic Harmony Generation on a Fixed Time Grid

Jump to:

Full text

1. Introduction

2. Method

2.1. Input Representation

2.2. Model Architecture

2.3. Diffusion-Inspired Unmasking in Training and Generation

3. Experimental Set-up and Dataset

3.1. Model Comparison

3.2. Data and Training

3.3. Evaluation Metrics and Protocols

3.3.1. Training Performance Assessment

3.3.2. Symbolic Music Metrics

3.3.3. FMD

4. Results

4.1. Training Performance

4.2. Music Metrics

4.3. FMD Scores

5. Conclusions

Abstract

Details

Suggested sources