EEG-Based Music Emotion Prediction Using

Full text

Turn on search term navigation

1. Introduction

Music-generation algorithms have advanced significantly in recent years, enabling the creation of innovative and realistic compositions with a wide range of applications, including the generation of multimedia music for video games [1], movies and television shows [2], developing educational tools for teaching music [3], serving as a relevant therapeutic strategy for helping people with neurological disorders and children with mild learning disabilities [4,5], facial emotion recognition for mobile devices [6], among others [7]. However, bridging neural and auditory domains poses challenges, such as defining musical behaviors, linking neurophysiology to musical tasks, and addressing cognitive capacity limitations [8]. The main challenge is linking neural activity to musical perception and cognition [9]. Music engages multiple brain regions, requiring advanced analytical methods to decode its distributed neural patterns [10,11]. Brain–computer interface (BCI) devices, which utilize EEG to decode brain activity, are widely used for music interaction and control due to their non-invasiveness, portability, affordability, high temporal resolution [12,13], and ability to integrate multimodal sensors [14,15]. BCI systems are rapidly evolving as multisensory machine learning tools, enhancing mental and motor imagery cognitive functions that underpin creativity, memory, and planning [16,17]. Nonetheless, existing EEG-based emotional music-generation systems face key limitations. First, functional auditory cortical representations rely on hand-crafted low- and high-level features, often introducing subjective biases that constrain the musical experience [18,19]. Alternatively, the use of third-party software in music generation can result in a loss of control over the consistency between generated music and intended emotional expressions [20]. Therefore, a controllable music-generation framework is essential to ensure accurate correspondence with emotions inferred from EEG signals.

Although advancements in human–computer interaction highlight the importance of emotion recognition for system interpretation, EEG-based music generation faces additional restrictions such as signal noise, artifacts, and environmental interference [21]. In addition, capturing musical emotion is hindered by issues related to dataset quality, annotation consistency, and model generalization [22]. Therefore, rigorous data cleaning and preprocessing are essential for extracting reliable insights [23]. Another restriction is that EEG-based music cognition research is limited by single-modality neuroimaging, hindering the precise identification of brain regions and temporal sequences in musical processing [24]. Enhancing spatial sampling can be achieved by increasing electrode count or integrating EEG’s temporal resolution with other neuroimaging techniques via multimodal fusion for improved emotion recognition [25]. However, both methods are often constrained by high costs and limited portability [26]. The brain’s nonlinear dynamics challenge traditional EEG BCI analysis, which relies on simple features and may miss complex neural patterns [27]. To address this, data-driven deep learning (DL) models are increasingly used for emotion recognition from natural music stimuli [28,29,30]. However, training deep learning models on raw EEG data for music-related brain activity is challenged by high dimensionality, complex EEG dynamics [31,32], the need for large labeled datasets to prevent overfitting [33,34], the absence of long-term musical structures [35], and the lack of standardization in genres, styles, and moods [36], among others. Generally, DL-based music generation can be refined in terms of both neurophysiological models for emotion perception and musical structures, highlighting key areas for further cognitive and neurophysiological research [9]. Thus, recent advancements often adopt a hierarchical structure with multiple large models dedicated to semantic, coarse acoustic, and fine acoustic modeling [37,38,39]. At the same time, various approaches, such as emotion-conditioned models, integrate emotional cues to enhance the expressiveness and resonance of AI-generated music [40]. While these models improve the ability of AI-generated music to convey intended emotions—enhancing user experience and creative potential—they face challenges, including the subjective nature of emotion evaluation (which is highly context-dependent and difficult to model accurately) and the scarcity of high-quality datasets necessary for generating diverse and precise musical outputs [10]. However, the main limitation of incorporating training frameworks with more complex context models lies in generating music in real time while adapting to physiological signals. This requires efficient models and hardware, making implementation challenging due to the high computational costs, which can render real-time generation impractical [41].

Nonetheless, integrating DL models with EEG BCI devices imposes additional constraints, particularly in capturing fine-grained temporal and spatial neural patterns [42,43], as well as addressing individual variability, which hinders model generalization for controllable deep data generation [44,45]. Another key challenge in BCI-based music cognition research is linking features extracted from neural responses to musical concepts, emphasizing the need for clear associations between extracted neural features and high-level musical structures [46,47]. To address these challenges, this study introduces a novel DL framework for generating MIDI sequences that align with labeled emotion predictions using supervised feature extraction from the neural and auditory domains. Specifically, we employ the EEGNet model for neural data and an autoencoder-based piano algorithm for auditory data. Due to the heterogeneity between these high-dimensional modalities, we incorporate Centered Kernel Alignment to provide the largest separation between emotional states. To further enhance MIDI reconstruction quality, we propose a regression between both feature domains to minimize dispersion caused by the high intra-subject variability of EEG patterns. This is achieved by clustering the latent representations obtained through CKA for MIDI into denser partitions using the k-NN algorithm. Using musical metrics, the evaluation of real-world data shows that the proposed approach improves emotion classification (namely, between arousal and valence) and the system’s ability to produce MIDI sequences that better preserve temporal alignment, tonal consistency, and structural integrity. Regarding inter-subject EEG variability, the validation results indicate that the higher the accuracy of emotion classification labels, the more consistent the quality of auditory outputs. Consequently, the performance of the proposed DL framework may decline significantly for subjects with limited proficiency in BCI usage.

2. Methods

2.1. Auditory Feature Extraction

2.1.1. Neural Network-Based Auditory for Audio Conversion

In general terms, automatic music transcription systems take an audio waveform as input, typically derived using its magnitude spectrogram $Z \in R^{+}$ (i.e., the short-time Fourier transform magnitude) and produce a pitch representation over time (termed piano-roll algorithm [48]. Firstly, the non-negative time–frequency representation of the input $Z \in R^{T \times F}$ is split into a product of two matrices: a dictionary $D \in R^{F \times ϱ}$ and an activation matrix $A \in R^{ϱ \times T}$ , minimizing the distance $d (\cdot, \cdot) \in R$ between $Z$ and $Y = D A$ . The optimizing framework for finding the non-negative matrices, $D$ and $A$ , can be expressed as follows:

(1a) $\begin{matrix} Y & = D A \approx Z \end{matrix}$

(1b) $\begin{matrix} s . t . : min d (Y, Z) \to 0 \end{matrix}$

where

T \in N

is the number of time points,

F \in N

is the number of spectral components, and

ϱ \in N

is the decomposition rank of

Z

Non-negative matrix factorization (NMF) provides a simple and effective approach to solving the minimization problem in Equations (1a) and (1b) using multiplicative update rules. In this framework, the spectrum at time point $t \in T$ (the t-th row of $Z$ ) is modeled as a linear combination of $ϱ$ spectral templates (columns of $D$ ). Consequently, a single spectral template represents the expected spectral energy distribution for each note produced by an instrument, while the corresponding activation row in $A$ captures the intensity of that note’s playback over time [49].

2.1.2. Autoencoder-Based Extraction of Piano-Roll Features

Next, the latent auditory feature set is extracted by applying semantic segmentation to N samples of the piano-roll representation obtained in the previous audio conversion step, expressed as ${Y_{n} \in R^{h \times g \times c} ∣ n \in N}$ , where h stands for height, g is width, and c is the number of image color channels (for simplicity, $C 1$ is assumed).

A label mask, $M_{n}$ , is generated from the estimated 2D image-like set, $Y_{n}$ , to determine the class membership of each pixel in the n-th image. Each binary matrix mask, $M \in {0, 1}^{h \times g}$ , provides a two-label representation, indicating either the presence (i.e., 1) or absence (0) of a specific note in the piano-roll array. The label set is derived using autoencoder models enhanced with convolutional layers for semantic segmentation, as described in [50,51]:

(2) $\begin{matrix} \hat{M} & = (φ_{L} () \circ \dots \circ φ_{l} (μ_{l - 1}) \circ \dots \circ φ_{1} ()) {Y} \\ φ_{l} : R^{h_{l - 1} \times g_{l - 1} \times D_{l - 1}} \to R^{h_{l} \times g_{l} \times D_{l}}, l \in L \end{matrix}$

where the function composition for the l-th layer is

φ_{l} {\cdot}

, and

μ_{l} = φ_{l} (μ_{l - 1}) .

is a single feature map at layer l that the neural network model produces in the manner outlined below:

(3) $\begin{matrix} μ_{l} & = ψ_{l} (W_{l} ⊛ μ_{l - 1} + b_{l}) \in R^{h_{l} \times g_{l} \times D_{l}} \end{matrix}$

where

W_{l} \in R^{κ_{l} \times κ_{l} \times D_{l - 1} \times D_{l}}

and

b_{l} \in R^{D_{l}}

denote the learnable parameters that define the nonlinear activation function

ψ_{l} (\cdot)

. The parameter

κ_{l}

specifies the dimensions of the l-th convolutional kernel, while

D_{l}

represents the size of the extracted feature set. The symbol ⊛ denotes image-based convolution.

In turn, the following optimizing procedure estimates the parameter set $Θ = {W_{l}, b_{l}}$ needed in Equation (3) [52]:

(4) $\begin{matrix} Θ^{*} & = arg min_{Θ} E \{L {M_{n}, {\hat{M}}_{n} | Θ} : \forall n \in N\}, \\ L : {0, 1}^{h \times g} \times {[0, 1]}^{h \times g} \to R \end{matrix}$

where

L

is the selected loss function, and

E \cdot

is the notation for expectation operator.

Thus, to capture the most salient information at the l-th network layer, the function composition in Equation (2) transforms the input feature map from the preceding layer $(l - 1)$ into the output feature map $μ_{l}$ .

2.2. Supervised Extraction of Neural Responses

EEGNet-Based Feature Extraction

We assume that the emotion-prediction training space consists of two sets, $X, Λ$ , where $X$ contains EEG recordings, each spanning $T \in N$ time instants and captured using a C-channel montage. Specifically, $X = X_{r} \in R^{C \times T} : r \in R$ . Note that this procedure is performed in parallel with supervised auditory feature extraction, but employing the label set $Λ = λ_{r} \in {[0, 1]}^{K}$ that encodes the emotional information with $R \in N$ single-trial signals, classified according to an emotion recognition paradigm with a fixed number of emotional states, $K \in N$ .

In this approach, a stack of spatial and temporal layers is employed to build the deep learning (DL) architecture for motor imagery (MI) classification, which predicts one-hot encoded class memberships by processing the input EEG data:

(5) $\hat{λ} = M (ξ_{L} \circ \dots \circ ξ_{1}) {X},$

where

M : R^{C \times T} \to {[0, 1]}^{K}

represents the mapping function (i.e., the neural network) applied to

X

, which performs a layer feature map with

P_{l} \in N

elements at the l-th layer.

The same mapping process is defined as follows:

$\tilde{X_{l}} = ξ_{l} ({\tilde{X}}_{l - 1} \otimes W_{l} + b_{l}), \tilde{X_{l}} \in R^{P_{l}}$

where

ξ_{l} : R^{P_{l - 1}} \to R^{P_{l}}

denotes a learning function incorporating a nonlinear activation,

W_{l} \in R^{P_{l - 1} \times P_{l}} : l \in L

represents the set of layer weights,

b_{l} \in R^{P_{l}}

is the bias term, and

L \in N

indicates the network depth. The symbols ∘ and ⊗ denote function composition and a given tensor operation, respectively, such as a fully connected product or 1D/2D convolution.

Likewise, the estimation framework described in Equation (5) relies on optimizing the parameter set $Θ = W_{l}, b_{l}$ , analogous to the earlier case of auditory feature extraction. Consequently, the following optimization process is performed across the trial set:

(6) $Θ^{*} = arg min_{Θ} E \{L (λ_{r}, {\hat{λ}}_{r} | Θ) + γ Ω (Θ) : \forall r \in R\},$

where the loss function

L : {0, 1}^{K} \times {[0, 1]}^{K} \to R

measures the prediction error, the regularization function

Ω (\cdot)

introduces constraints to avoid overfitting, and the trade-off term

γ \in R^{+}

balances fitting the data against incorporating prior knowledge. Notably, we define

{\tilde{X}}_{0} = X

{\tilde{X}}_{L} = \hat{λ}

, and employ a specific sigmoid function as the nonlinear activation to facilitate supervised feature extraction.

To enhance the classification of neural responses, the optimization process outlined in Equation (6) is executed within the EEG-Net framework. This framework functions as a compact convolutional network, enabling convolutional kernel connectivity between input and output feature maps.

Therefore, the EEGNet pipeline begins with temporal convolution to construct a set of frequency filters, as detailed in [53]. This is followed by depth-wise convolution applied to each feature map to learn spectro-spatial filters. Subsequently, a separate convolution captures temporal summaries, and a point-wise convolution determines how to combine the feature maps for class membership prediction.

2.3. Affective-Based Prediction of MIDI Data

At this stage, the extracted auditory features ${Y_{n}}$ are linked with computed EEG patterns in a supervised manner using the affective label set $Λ$ . To evaluate the relationship between the extracted auditory features and the EEG patterns, the outputs of the deep model are processed through the Centered Kernel Alignment algorithm. However, given the significant variability in the extracted features—primarily the high variation in predicted emotion labels—we adapt a convolutional neural network, initially developed for BCI classification tasks, into a regression framework commonly employed for emotion recognition.

2.3.1. Kernel-Based Alignment Between Supervised Auditory and Neural Features

Using the available affective label set, $Λ$ , we identify closer associative similarities. Specifically, we apply Centered Kernel Alignment (CKA) to the two extracted feature sets: the autoencoder-based piano-roll features, $\tilde{Y} \in Y$ , and the EEGNet-based features, $\tilde{X} \in X$ , obtained through supervised extraction. These features are extracted in a supervised manner, utilizing the available affective label set, $Λ$ . To achieve this, the nonlinear relationships among samples in each space are encoded using corresponding positive-definite matrices (kernels): $K_{\tilde{X}} : X \times X$ and $K_{\tilde{Y}} : Y \times Y$ . As detailed in [54] regarding the sonification of affective neural responses, both feature matrices are computed within the framework of reproducing kernel Hilbert spaces.

As further detailed in [55], the empirical Centered Kernel Alignment that is denoted as ${\hat{ρ}}_{C K A} (\cdot, \cdot) \in R [0, 1]$ measures the similarity between the two random sets $\tilde{X}$ and $\tilde{Y}$ . This real-valued alignment is estimated as follows:

$\begin{matrix} {\hat{ρ}}_{C K A} (K_{\tilde{X}}, K_{\tilde{Y}}) & = \frac{{⟨ {\bar{K}}_{\tilde{X}}, {\bar{K}}_{\tilde{Y}} ⟩}_{F}}{\sqrt{∥ {\bar{K}}_{\tilde{X}} ∥_{F} {∥ {\bar{K}}_{\tilde{Y}} ∥}_{F}}}, \end{matrix}$

where the centered kernel matrix is computed as

{\bar{K}}_{Z} = H {\bar{K}}_{Z} H^{⊤}

with

H = I - \frac{1}{N_{h}} 1_{N} h 1_{N} h^{⊤}

I_{N} h

is the identity matrix, and

1_{N} h

as the all-one vector the size of

N_{h}

(

N_{h}

is the cardinal of

trace {H}

). Notations

{∥ \cdot ∥}_{F}

and

{⟨ \cdot, \cdot ⟩}_{F}

stand for the Frobenius norm and inner product, respectively.

2.3.2. k-Neighbors-Based Emotion Label Prediction of MIDI Representations

Given the emotional complexity of music, particularly in intricate examples, and the continuous nature of emotions, recognition models that provide a more detailed depiction of subjective experiences are favored over those that reduce affective states to a limited set of categories [56].

It is important to note that models, due to their high nonlinearity, are inherently black-box and lack interpretability. To address this, the introduction of the k-nearest neighbors (k-NN) algorithm as a direct interpretability procedure enables better alignment of the matched spaces. In this context, we utilize the (k-NN) algorithm to identify potential emotional neural responses elicited by music clips and predict their closest MIDI representations, leveraging the simplicity and generalizability of the prediction models involved.

Specifically, the latent space extracted by the encoder is compared with the clustered EEGNet features. Using $λ_{r} \in Λ$ as the original labels for arousal and valence, and $\hat{λ}$ as the EEGNet predictions, the Euclidean distance is calculated as $λ_{Δ} = {| λ_{r} - \hat{λ} |}_{2}$ . This distance is then processed through a softmax activation function:

(7) $softmax (λ_{Δ}) = \frac{e^{λ_{Δ}}}{\sum_{i = 1}^{n} e^{λ_{Δ_{i}}}}$

Afterwards, we define $X_{c o d e}$ as:

(8) $X_{c o d e} = softmax (λ_{d i s t}) \cdot Y_{c o d e}$

where

Y_{c o d e}

is the output of the piano-roll encoder.

It should be noted that the clustering technique described above is applied to each examined individual.

3. Experimental Set-Up

3.1. Evaluating Framework and Tested Dataset

In this work, we present a deep learning approach for EEG-based emotion prediction triggered by music clips, incorporating supervised feature extraction for MIDI generation Specifically, the proposed method is evaluated using the DEAP Dataset for Emotion Analysis. The evaluation process for generating auditory data from affective EEG representations comprises the following stages (Figure 1):

(i). The segment-wise preprocessing and time-windowing of the training data: This stage involves dividing EEG signals into non-overlapping or overlapping time segments to ensure a structured representation of temporal features. The preprocessing steps include filtering, artifact removal, normalization, and noise reduction to enhance signal quality. The resulting time-windowed data serve as input for subsequent feature extraction.
(ii). Supervised deep feature extraction for emotion classification: This step involves computing EEG neural responses using the EEGNet framework, a convolutional neural network specifically designed for EEG signal analysis. The extracted features are then used to classify different emotional states. Additionally, auditory features are derived from music clips using the piano-roll algorithm, which structures music data into the pitch, velocity, and timing characteristics of musical notes.
(iii). Affective-based MIDI prediction and feature alignment: In this stage, EEG-derived features are aligned with auditory features from music clips to generate MIDI representations that reflect the participant’s emotional responses. Machine learning techniques are employed to correlate EEG signals with emotional cues embedded in the auditory data. The final output is a MIDI prediction representing affective states derived from EEG patterns.

3.1.1. EEG Data Description

This evaluated database is publicly available at https://www.eecs.qmul.ac.uk/mmv/datasets/deap/ (accessed on 1 November 2024) and contains EEG recordings and peripheral physiological signals collected from M = 32 participants who viewed 40 one-minute music video excerpts. After watching each excerpt, each subject rated the video according to the four emotion conditions: arousal, valence, like/dislike, dominance, and familiarity. As described in [57], the EEG paradigm is based on stimulus selection through affective tag retrieval, video highlight detection, and an online evaluation tool.

From each subject, the EEG signal at 512 Hz cut-off frequency was recorded in a 32-channel montage consisting of the following electrodes: Fp1, AF3, F3, F7, FC5, FC1, C3, T7, CP5, CP1, P3, P7, PO3, O1, Oz, Pz, Fp2, AF4, Fz, F4, F8, FC6, FC2, Cz, C4, T8, CP6, CP2, P4, P8, PO4, and O2.

In the preprocessing step, the data were downsampled to 128 Hz, and artifacts, including EOG signals, were removed as performed in [57]. A band-pass filter was then applied to retain frequencies in the 4–45 Hz range. Common-reference spatial filtering was performed to mitigate volume-conduction effects. The filtered signals were segmented into 60 s trials, excluding a 3-s pre-trial baseline. Finally, the EEG channels were reordered according to the Geneva Order for easier interpretation.

3.1.2. Auditory Data Description

This dataset includes a collection of music-video clips curated to elicit specific mental states associated with the emotion space [58]. The acquisition protocol began with a 2-min baseline recording in a relaxed state. During each trial, participants listened to a one-minute audio recording. Two sessions, each consisting of 20 trials, were conducted, separated by a break to verify the quality of the acquired data and ensure proper electrode placement. At the end of each trial, participants rated their emotional levels, categorized as arousal, valence, liking, and dominance.

For evaluation, each one-minute trial was divided into ten non-overlapping segments of music and EEG data, each lasting six seconds. To capture more musical information, this interval was set longer than the four-second duration used in [59]. Additionally, nine audio segments with fade-out endings were excluded, resulting in 391 valid testing segments instead of the original 400.

Note that each of the 391 validation segments represents a six-second EEG recording aligned with its corresponding MIDI segment. The dataset was split into an 80% training set and a 20% testing set for cross-validation.

3.2. Enhanced Feature Extraction Using Deep Learning Models

3.2.1. Deep Feature Extraction from EEG Data

In this study, we adopt the EEGNet framework, a widely used deep learning model for EEG-based emotion recognition, specifically for predicting arousal and valence. In this paradigm, the objective is to predict continuous values within the arousal–valence space, a commonly used representation in affective computing. Arousal refers to the intensity of an emotional response, ranging from calm to excited, while valence reflects emotional polarity, from negative to positive. This approach provides a more fine-grained representation of affective states, as opposed to relying on discrete categorical classifications.

To address significant inter-segment variability, dimensionality reduction is applied to the extracted feature sets to identify the optimal temporal segments for predicting affective labels. Table 1 outlines the configuration of the EEGNet framework used in this study. Notably, the outputs (highlighted layers) are specifically configured to enable the neural network to function as a regression model.

3.2.2. Deep Feature Extraction from MIDI Data

Initially, the 40 one-minute audio clips from the DEAP dataset are transcribed into MIDI format using an automatic music transcription method with a multi-output structure to enhance frame-level note accuracy, as described in [60]. The transcription is performed using the default tool parameters: note segmentation is set to $0.5$ s, the model confidence threshold to $0.3$ , the minimum note length to 11, and the MIDI tempo to 120 beats per minute.

Next, convolutional deep learning models are applied, requiring 2D representations of music scores. These are created by converting each score over time segments into an evolving image array, known as the pitch or piano-roll representation. In this work, the pretty-midi package is used to generate a binary-valued mask by thresholding the incoming audio data, as described in [61]. Subsequently, the required MIDI segmentation for feature extraction is performed on the generated binary mask using a variant of the U-Net neural network, referred to as the AutoEncoder Piano Roll.

Table 2 presents the fixed configuration of the U-Net architecture, where the residual layers (highlighted in bold) are connected to the add layer instead of being arranged sequentially. The implementation code for the AutoEncoder Piano Roll is publicly accessible at https://github.com/hdperezn/Autoencoder_Piano_Roll (accessed on 1 November 2024).

3.3. Proposed Deep Alignment Between Enhanced EEG and MIDI Feature Sets

At this stage of data alignment, a regression model is employed to align the extracted feature sets from EEGNet and the autoencoder. However, the following constraints are imposed when working with the estimated feature sets:

(a). The label scores produced by the aforementioned feature-extraction algorithms are binary, making them insufficient to capture the heterogeneity between the spaces being aligned.
(b). The derived MIDI sets tend to present high variability when representing each emotion.

To address these limitations, we propose aligning the spaces through a linear combination of the enhanced feature sets using k-nearest neighbor representations. Specifically, the k nearest embedded EEG features are identified for emotion score prediction by extracting the corresponding MIDI feature sets, following these steps:

The deep learning framework extracts the most accurate segment-wise feature sets for predicting affective scores. To function as a real-valued score predictor, the softmax output layer is divided into two separate output layers: one for predicting valence ( $δ_{v a l e n c e}$ ) and another for arousal ( $δ_{a r o u s a l}$ ). Each layer uses a sigmoid activation function and is trained separately using Mean Squared Error (MSE) loss.
Similarly, for EEG data, the feature-extraction process is enhanced by leveraging the subjects’ labels to obtain piano-roll representations of the musical stimuli that best distinguish the affective tags. To achieve this, the selected autoencoder architecture is operated in a supervised mode by incorporating Centered Kernel Alignment (CKA) as described in Equation [eq]. This approach forces the bottleneck layer to learn the most discriminative feature sets. In essence, the piano-roll algorithm reconstructs the MIDI representation, extracting the feature sets that provide the greatest separation between emotional states.

Therefore, the estimated neighbors are mapped onto the autoencoder space, enabling the identification of the autoencoder’s embedded representation corresponding to each EEG signal. The algorithm for the proposed enhanced alignment is outlined in Algorithm 1.

Algorithm 1 The prediction of labels generated by the piano-roll features.

1:. procedure GeneratePianoRoll( $y_{pred_valence}, y_{pred_arousal}, y_{continuos_valence}, y_{continuos_arousal}, t r a i n_i n d e x, c k a_p r, X_{pr_train}$ )
2:. Initialization:
3:. $y_{pred_array} \leftarrow Concatenate (y_{pred_valence}, y_{pred_arousal}, axis = 1)$
4:. $y_{true_array} \leftarrow Concatenate (y_{continuos_valence} [t r a i n_i n d e x], y_{continuos_arousal} [t r a i n_i n d e x], axis = 1)$
5:. Distance Calculation:
6:. $y_{dist} \leftarrow CalculateDistances (y_{pred_array}, y_{true_array}, ‘ euclidean ’)$
7:. $d e n s e_y d i s t \leftarrow SOFTMAX (y_{dist})$
8:. Index Selection:
9:. $m a x_i n d i c e s \leftarrow Top 5 Indices (- d e n s e_y d i s t . n u m p y (), axis = 1)$
10:. $r e s u l t_a r r a y \leftarrow CreateResultMatrix (d e n s e_y d i s t . n u m p y (), m a x_i n d i c e s)$
11:. EEG Encoding:
12:. $X_{pr_code} \leftarrow Encode (X_{pr_train} [t r a i n_i n d e x])$
13:. $e e g_c o d i n g \leftarrow MatrixMultiply (r e s u l t_a r r a y, X_{pr_code})$
14:. Decoding:
15:. $X_{pr_pred} \leftarrow Decode (e e g_c o d i n g, c k a_p r)$
16:. Return: $X_{pr_pred}$
17:. end procedure

Figure 2 illustrates the impact of the CKA loss on structuring the latent space of the MIDI autoencoder within the emotion recognition framework. Figure 2a displays the true emotion labels for each stimulus, where each dot represents the arousal (x-axis) and valence (y-axis) ratings assigned by Subject #1, highlighting the continuous nature of emotion classification in the arousal–valence space. Figure 2b shows the MIDI latent space when the autoencoder is trained solely with a reconstruction loss, visualized using a t-SNE projection. Here, the representations are organized for MIDI reconstruction but lack a clear emotional structure. In contrast, Figure 2c demonstrates how incorporating CKA loss introduces a more distinct organization of the latent space according to arousal and valence. This alignment, achieved by enforcing similarity with the EEG domain, not only preserves reconstruction performance but also enhances interpretability by maintaining an emotion-based structure. Overall, Figure 2 highlights the effectiveness of CKA loss in aligning the MIDI embedding space with emotional dimensions, thereby enabling more emotion-driven music generation.

3.4. Evaluation Performance

As outlined in [21], both musical and neurophysiological dimensions are integrated to comprehensively evaluate the proposed system using the following metrics:

3.4.1. Musical Evaluation Metrics

These metrics, derived from the MIDI feature-extraction process, assess the quality and structure of the generated musical compositions:

–. Pitch Class (PC): This metric evaluates the distribution of note pitches throughout a musical composition, based on the concept of pitch classes, where notes are grouped according to their relation to the musical scale. The PC metric is significant as it reveals modulation and tonality in the generated music. Analyzing the PC can help identify melodic and harmonic patterns that align with the intended musical objectives in MIDI generation.
–. Pitch Range (PR): The pitch range reflects the amplitude span of notes used in a composition, serving as a measure of tonal diversity and richness. A broader range indicates greater note variability, enhancing musical texture and auditory interest. When linked to EEG signals, PR provides insights into the composer’s emotional and cognitive variability during the creative process.
–. Inter-Onset Interval (IOI): This metric examines the time between the onset of one note and the start of the next, offering insights into the rhythmic structure of a composition. The IOI defines the flow and cadence of music. Its detailed analysis can uncover rhythmic patterns and dynamics essential for musical interpretation. In the context of EEG signals, IOI reflects the composer’s attention and focus during music creation, influencing the resulting compositions.
–. Pitch-Class Histogram (PCH): The PCH visually represents the frequency of occurrence for each pitch class in a composition. This metric highlights trends in pitch usage, identifying dominant notes and their relationships. Analyzing the PCH can uncover the harmonic and melodic structure of the generated music, offering valuable insights into the composer’s musical preferences and decisions.

3.4.2. Performance Classification Measures of Neural Responses

For EEGNet-based feature extraction, the model’s performance in classifying EEG data is evaluated using three key metrics:

–. Accuracy: This metric calculates the proportion of correctly classified instances out of the total number of samples, serving as a baseline indicator of the model’s predictive performance.
–. Cohen’s Kappa (κ): This statistic measures the agreement between predicted and actual labels while accounting for chance agreement, providing a more robust evaluation compared to accuracy alone.
–. Area Under the Curve (AUC): This metric assesses the model’s ability to differentiate between classes by summarizing its performance across all classification thresholds. Together, these metrics offer a comprehensive evaluation of the EEGNet model’s effectiveness in feature extraction and classification.

4. Results and Discussion

In this paper, we present three experiments to demonstrate the model’s capability to classify emotion categories and reconstruct MIDI sequences by analyzing musical characteristics and EEG data.

First, we assess the EEGNet-based model’s ability to classify EEG signals into arousal and valence categories, validating its effectiveness for such tasks. Performance metrics are provided, showing results that are competitive with state-of-the-art models.

Second, we evaluate the system’s ability to reconstruct musically coherent MIDI sequences. This involves comparing the original and reconstructed sequences, with results highlighting the model’s effectiveness in the MIDI reconstruction task.

Lastly, we investigate subject-specific influences on output quality using metrics such as KL divergence and overlap area. Collectively, these experiments highlight the robustness of the proposed approach, offering a comprehensive evaluation of its ability to classify EEG signals and generate musically coherent MIDI content.

4.1. Experiment 1: Comparison with State-of-the-Art Models

To evaluate the performance of the EEGNet model used in this study, a comparative analysis was conducted against FusionNet, a state-of-the-art model commonly employed in EEG classification tasks. FusionNet was selected for its prominence in the field and its reported effectiveness in similar scenarios. Table 3 presents the evaluation metrics, including accuracy and Cohen’s kappa, chosen to assess both the predictive performance of the models and the agreement between predicted and true labels.

The results demonstrate that EEGNet achieved competitive performance, with an average accuracy of $75.0 \pm 14.1 %$ , compared to $57.7 \pm 15.3 %$ for FusionNet. This indicates that EEGNet effectively classifies EEG signals with high precision. Furthermore, EEGNet achieved a Cohen’s kappa score of $50.9 \pm 28.3 %$ , compared to $45.7 \pm 19.8 %$ for FusionNet, reflecting a strong agreement between the model’s predictions and the ground truth, thereby confirming its reliability.

While previous claims of robustness for FusionNet were based on its performance across various datasets, this study focused exclusively on a single dataset to ensure a controlled and consistent evaluation framework. Although these results highlight EEGNet’s potential, further testing on diverse datasets is necessary to confirm its robustness and generalizability.

These findings establish EEGNet as a competitive model for EEG signal classification in controlled experimental conditions, offering strong predictive capabilities and reliable agreement with true labels.

Overall, while more complex models, such as large-scale transformer-based architectures or diffusion models, could enhance feature extraction and improve MIDI reconstruction, their effectiveness heavily relies on the availability of high-quality, large-scale EEG datasets with reliable annotations. The inherent noise, inconsistencies in labeling, and subject-dependent variability in existing EEG datasets make it challenging to justify the implementation of such resource-intensive techniques [63,64].

4.2. Experiment 2: MIDI Reconstruction

We also assess the system’s capability to reconstruct musical stimuli using a diverse set of MIDI inputs. This evaluation focuses on how well the reconstructed outputs preserve critical musical attributes such as pitch, rhythm, harmony, and overall structural integrity. Figure 3 presents the results of this analysis, offering a visual comparison between the original MIDI sequences (top panel) and their corresponding reconstructions (bottom panel). The clear alignment between the sequences highlights the model’s effectiveness in maintaining key musical features during the reconstruction process.

Quantitative evaluation was performed using the Dice coefficient, which measures the similarity between the original and reconstructed sequences. The analysis yielded an average Dice score that tends to small values, indicating a high level of accuracy in replicating the original MIDI structures. This metric highlights the model’s ability to maintain both temporal alignment and tonal consistency, even when processing complex or dense musical inputs. Additionally, the system exhibited minimal reconstruction errors, further demonstrating its reliability in handling intricate musical patterns.

Notably, the model effectively reconstructed complex MIDI inputs in challenging scenarios, such as rapid note transitions, overlapping melodies, or dynamic variations—areas that often pose significant challenges for generative models. The system’s resilience in these situations underscores its versatility and adaptability across diverse musical contexts, making it well suited for a wide range of applications [65].

These findings are pivotal, as they establish a solid foundation for leveraging the system in subsequent phases of the study. By demonstrating its proficiency in preserving critical attributes of musical stimuli during reconstruction, the model has proven to be a reliable tool for generative tasks [66]. This capability is essential for the next phase, where the reconstructed MIDI sequences will be utilized for real-time music generation and adaptive music systems.

Thus, the results validate the proposed model’s effectiveness in handling MIDI data with precision and robustness. The demonstrated reconstruction accuracy paves the way for its integration into advanced applications, bridging the gap between computational techniques and human-perceived musical quality [67].

4.3. Experiment 3: Performance Evaluation Across Subjects

This experiment aimed to evaluate how individual subject variability influences the quality and characteristics of the generated outputs. The analysis investigated the impact of inter-subject differences in EEG signals on the system’s ability to produce consistent and high-quality MIDI sequences. Both visual and quantitative assessments were performed.

Figure 4 and Figure 5 provide graphical representations of the variability observed in the generated outputs. Figure 4 displays the probability density functions (PDFs) of key acoustic metrics for each subject, illustrating the range and distribution of outputs. Figure 5 complements this by offering a compact visualization of metric distributions, emphasizing differences in central tendency and variability across subjects. Together, these visualizations highlight patterns of individual subject variability and their influence on the model’s generative performance.

Additionally, a detailed comparison of acoustic metrics was conducted across three distinct scenarios:

–. Training versus Training (T vs. T): Evaluates the consistency of outputs generated from EEG data within the training group, serving as a baseline.
–. Training versus Worst-Performing Subject (T vs. W): Assesses the impact of the worst-performing subject’s EEG data on the quality of generated MIDI sequences.
–. Training versus Best-Performing Subject (T vs. B): Highlights the influence of the best-performing subject’s EEG data, offering insights into the system’s upper performance limits.

For each scenario, the results of these comparisons are presented in Table 4, respectively. For each comparison, metrics such as mean and standard deviation (to summarize central tendencies and variability), KL divergence (to quantify differences between feature distributions), and overlap area (to assess distribution similarity) were calculated.

The findings reveal significant variability in how individual subjects influence the quality of generated MIDI sequences:

–. In the T vs. T group, metrics demonstrate high consistency, reflecting the system’s ability to generalize across training data.
–. In the T vs. W comparison, increased KL divergence and reduced overlap area indicate that EEG variability from the worst-performing subject introduces challenges in maintaining output quality.
–. Conversely, the T vs. B analysis shows reduced divergence and higher overlap, suggesting that EEG signals from the best-performing subject align closely with the training data, enabling the system to generate more coherent and high-quality outputs [68].

These results underscore the critical role of subject-specific EEG variability in shaping the performance of EEG-based generative systems [68]. The robust evaluation framework applied in this experiment provides a comprehensive understanding of how inter-subject differences affect system outputs, paving the way for refinements in training protocols and model optimization strategies.

5. Concluding Remarks

This study presented a novel framework for generating musically coherent MIDI sequences from EEG-based emotion predictions, utilizing deep learning architectures to bridge neural and auditory domains. The approach integrates advanced feature-extraction techniques, including EEGNet for emotion classification and an autoencoder-based piano-roll representation for auditory feature generation. These components were aligned using Centered Kernel Alignment (CKA) to establish robust mappings between EEG-derived features and their corresponding musical outputs.

Nonetheless, we would like to highlight the following key aspects of our proposal:

Emotion Classification Performance: The EEGNet model demonstrated strong performance in classifying emotional states (arousal and valence) from EEG signals, achieving an average accuracy of $75.0 \pm 14.1 %$ and a Cohen’s kappa score of $50.9 \pm 28.3 %$ . These results surpass state-of-the-art models such as FusionNet, highlighting the model’s reliability in distinguishing emotional states with precision. The focus on a single dataset ensured consistent performance in controlled experimental conditions, setting a robust benchmark for emotion classification.
MIDI Reconstruction Evaluation: The system’s ability to reconstruct MIDI sequences was rigorously assessed using musical metrics such as pitch class (PC), pitch range (PR), inter-onset interval (IOI), and pitch-class histogram (PCH). The results showed that the reconstructed sequences preserved critical musical features, maintaining temporal alignment and tonal consistency. The system exhibited resilience in handling challenging scenarios, such as rapid note transitions, overlapping melodies, and dynamic variations. High Dice similarity scores between the original and reconstructed MIDI sequences confirmed the system’s effectiveness in retaining musical attributes and structural integrity, making it suitable for generative and adaptive music applications.
Subject-Specific Variability: A detailed analysis of subject-specific variability revealed how individual differences in EEG signals influence output quality. Metrics such as KL divergence and overlap area indicated that worst-performing subjects introduced greater divergence and reduced overlap, challenging output consistency. Conversely, the best-performing subjects exhibited closer alignment with training data, producing more coherent and higher-quality MIDI outputs. These findings highlight the importance of adaptive training protocols to account for inter-subject differences, enhancing generalizability and consistency.
CKA-Based Alignment: The CKA-based alignment technique proved effective in extracting highly discriminative features from both neural and auditory spaces. This alignment improved the separation of affective labels and enhanced the quality of the reconstructed MIDI sequences. Visual tools such as probability density functions (PDFs) and violin plots illustrated the system’s adaptability and performance across diverse EEG profiles, reinforcing its robustness.

Although the proposed framework is validated on real-world data, its applicability to real-world music generation still requires further refinement. To develop a methodology competitive with state-of-the-art generative models that do not rely on EEG inputs, the first step is to establish a larger and more robust dataset. Such a dataset would enable integration with more advanced generative architectures, such as diffusion-based models or autoregressive transformers, which require extensive training data to produce high-quality musical outputs. However, the current dataset size and inter-subject variability impose limitations on directly implementing these models, making our approach an effective yet computationally feasible alternative under the given conditions.

Another factor limiting its real-world application is the high intra-subject variability, particularly among individuals with limited proficiency in BCI usage. This underscores the need for adaptive training protocols to enhance generalizability and optimize system performance.

In conclusion, and elucidating future directions, this research represents a significant advancement in integrating brain–computer interface technologies with generative music systems. By achieving competitive classification accuracy, robust MIDI reconstruction, and adaptability to individual EEG variability, the proposed framework lays a strong foundation for future innovations.

Potential directions for further exploration include: (i) Expanding the dataset diversity to improve generalizability; (ii) Optimizing neural architectures for enhanced scalability and performance; and (iii) Exploring real-time applications in adaptive and interactive music generation systems.

These efforts will continue to bridge computational techniques with human emotional expression, advancing the fields of affective computing and human–computer interaction while enriching the musical experience.

Based on the study’s findings and identified limitations, future work can enhance the framework by focusing on:

–. Data Diversity and Model Refinement: Expanding EEG datasets with diverse demographics, emotions, and experimental conditions (e.g., varied stimuli and EEG setups) is crucial for robustness. Furthermore, refining deep learning architectures with lighter, scalable networks—enhanced with regularization or self-supervised learning—can enable real-time implementations in resource-constrained environments.
As stated in [9], our DL proposal for music generation can be refined in terms of both neurophysiological models for emotion perception and musical structures, highlighting key areas for further cognitive and neurophysiological research.
–. User Customization and Musical Exploration: Tailoring the system with adaptive training and hybrid models to accommodate individual EEG variability will improve accuracy and musical coherence. Exploring diverse musical formats beyond piano roll (e.g., varied instrumentation, complex rhythms) can broaden the model’s capabilities.
–. Real-time Applications and User Evaluation: Developing live BCI-based interactive music systems that adapt to real-time EEG input will expand use cases. The integration of multi-sensor data (e.g., heart rate) could enrich affective inferences. Extensive user evaluations and psychological studies are needed to validate the perceived emotional quality of generated music, including its potential for therapy and mood regulation.

Author Contributions

Conceptualization, H.P.-N., O.G.-M. and G.C.-D.; data curation, H.P.-N.; methodology, H.P.-N., O.G.-M. and A.M.Á.-M.; project administration, H.T.-C. and G.C.-D.; supervision, A.M.Á.-M., H.T.-C. and G.C.-D.; resources, O.G.-M. and G.C.-D. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and Python 3.10 codes utilized in this study will be made available upon request via email.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Proposed deep learning framework for EEG-based emotion prediction using supervised feature extraction for MIDI generation. Stages: (i) segment-wise preprocessing; (ii) supervised deep feature extraction for emotion classification; and (iii) affective-based MIDI prediction and feature alignment.

Figure 1. Proposed deep learning framework for EEG-based emotion prediction using supervised feature extraction for MIDI generation. Stages: (i) segment-wise preprocessing; (ii) supervised deep feature extraction for emotion classification; and (iii) affective-based MIDI prediction and feature alignment.

View Image - Figure 2. Visualization of emotion labels and MIDI feature representations. (a) Emotion labels set by Subject [Forumla omitted. See PDF.], where the x-axis represents arousal and the y-axis represents valence. (b) Two-dimensional t-SNE projection ([Forumla omitted. See PDF.], perplexity [Forumla omitted. See PDF.]) of the piano-roll arrays, illustrating clustering of MIDI features based on emotion labels. Colors indicate the class of each audio stimulus. (c) Embedding space obtained from the bottleneck representation of the piano-roll autoencoder trained with CKA loss. Dots correspond to training data, while crosses (×) represent test data. Of note, the axes are resized to provide better visual perception of the plotted values.

Figure 2. Visualization of emotion labels and MIDI feature representations. (a) Emotion labels set by Subject [Forumla omitted. See PDF.], where the x-axis represents arousal and the y-axis represents valence. (b) Two-dimensional t-SNE projection ([Forumla omitted. See PDF.], perplexity [Forumla omitted. See PDF.]) of the piano-roll arrays, illustrating clustering of MIDI features based on emotion labels. Colors indicate the class of each audio stimulus. (c) Embedding space obtained from the bottleneck representation of the piano-roll autoencoder trained with CKA loss. Dots correspond to training data, while crosses (×) represent test data. Of note, the axes are resized to provide better visual perception of the plotted values.

View Image - Figure 3. Comparison between original MIDI (top) and reconstructed MIDI (bottom). Of note, the axes are resized to provide better visual perception of the plotted values.

Figure 3. Comparison between original MIDI (top) and reconstructed MIDI (bottom). Of note, the axes are resized to provide better visual perception of the plotted values.

View Image - Figure 4. Probability density functions (PDFs) of model characteristics for all subjects in fold 1. Each subfigure corresponds to a specific feature extracted from the MIDI data. (a) Feature 0: description of the feature (e.g., pitch range). (b) Feature 1: description of the feature (e.g., total used pitch). (c) Feature 2: description of the feature (e.g., average IOI). (d) Feature 3: description of the feature (e.g., pitch-class histogram).

Figure 4. Probability density functions (PDFs) of model characteristics for all subjects in fold 1. Each subfigure corresponds to a specific feature extracted from the MIDI data. (a) Feature 0: description of the feature (e.g., pitch range). (b) Feature 1: description of the feature (e.g., total used pitch). (c) Feature 2: description of the feature (e.g., average IOI). (d) Feature 3: description of the feature (e.g., pitch-class histogram).

Figure 5. Violin plots comparing metrics for the best- and worst-performing subjects.

Table 1

Detailed EEGNet architecture for MI classification.

Layer	Output Dimension	Parameters
Input Layer	$N_{c} \times N_{t} \times 1$	·
Conv2D	$N_{c} \times N_{t} \times F_{1}$	Temporal filter ( $F_{1}$ ) = 4, Kernel size = (1, 4)Padding = same, Bias = False
Batch Normalization	·	·
DepthwiseConv2D	$1 \times N_{t} \times 16$	Depth Multiplier (D) = 2, Kernel size = ( $N_{c}$ , 1)Bias = False
Batch Normalization	·	·
Activation	·	Activation = ELU
AveragePooling2D	$1 \times 32 \times 16$	Pool size = (1, 4)
Dropout	·	Dropout Rate = 0.6
SeparableConv2D	$1 \times 32 \times 32$	Temporal filter ( $F_{2}$ ) = 32, Kernel size = (1, 16)Padding = same, Bias = False
Batch Normalization	·	·
Activation	·	Activation = ELU
AveragePooling2D	$1 \times 4 \times 32$	Pool size = (1, 8)
Dropout	·	Dropout Rate = 0.6
Flatten	128	·
Dense	$N_{units}$	Units = $N_{classes}$ or 2
Activation	$N_{units}$	Activation = Softmax or Sigmoid

Notations: $N_{c}$ : number of EEG channels; $N_{t}$ : number of time samples; $F_{1}$ , $F_{2}$ : number of filters; $D$ : depth multiplier for depth-wise convolution; ELU: Exponential linear unit activation function; Dropout: proportion of neurons deactivated; Softmax/Sigmoid: activation functions for multi-class or binary classification.

Table 2

AutoEncoder Piano Roll—Encoder.

Layer	Output Dimension	Parameters
Input Layer	$N_{p} \times N_{t} \times 1$	·
Conv2D	$\frac{N_{p}}{2} \times \frac{N_{t}}{2} \times F_{1}$	$F_{1} = 32$ , Kernel size = 1Padding = same, Strides = 2
BatchNormalization	·	·
Activation	·	Activation = ReLU
Residual (Conv2D)	$\frac{N_{p}}{4} \times \frac{N_{t}}{4} \times F_{2}$	$F_{2} = 64$ , Kernel size = 3Padding = same, Strides = 2
Dropout	·	Dropout Rate = 0.5
Activation	·	Activation = ReLU
SeparableConv2D	$\frac{N_{p}}{2} \times \frac{N_{t}}{2} \times F_{2}$	$F_{2} = 64$ , Kernel size = 3Padding = same, Bias = False
BatchNormalization	·	·
Activation	·	Activation = ReLU
SeparableConv2D	$\frac{N_{p}}{2} \times \frac{N_{t}}{2} \times F_{2}$	$F_{2} = 64$ , Kernel size = 3Padding = same, Bias = False
BatchNormalization	·	·
MaxPooling2D	$\frac{N_{p}}{4} \times \frac{N_{t}}{4} \times F_{2}$	Pool size = 3, Strides = 2
Add [MaxPooling2D, Residual]	·	·
Residual (Conv2D)	$\frac{N_{p}}{8} \times \frac{N_{t}}{8} \times F_{3}$	$F_{3} = 128$ , Kernel size = 3Padding = same, Strides = 2
Dropout	·	Dropout Rate = 0.5
Activation	·	Activation = ReLU
SeparableConv2D	$\frac{N_{p}}{4} \times \frac{N_{t}}{4} \times F_{3}$	$F_{3} = 128$ , Kernel size = 3Padding = same, Bias = False
BatchNormalization	·	·
Activation	·	Activation = ReLU
SeparableConv2D	$\frac{N_{p}}{4} \times \frac{N_{t}}{4} \times F_{3}$	$F_{3} = 128$ , Kernel size = 3Padding = same, Bias = False
BatchNormalization	·	·
MaxPooling2D	$\frac{N_{p}}{8} \times \frac{N_{t}}{8} \times F_{3}$	Pool size = 3, Strides = 2
Add [MaxPooling2D, Residual]	·	·

Notations: $N_{p}$ : number of pitch classes; $N_{t}$ : number of time steps; $F_{1}$ , $F_{2}$ , $F_{3}$ : number of filters applied in each block; kernel size: size of the convolutional filters; strides: step size for the convolution; dropout rate: proportion of neurons dropped during training; ReLU: rectified linear unit activation function.

Table 3

Method comparison results for the average emotion classification on the DEAP database.

Approach	Accuracy (%)	Kappa (%)	AUC (%)
EEGNet Valance	$74.2 \pm 13.1$	$48.9 \pm 26.6$	$74.3 \pm 13.3$
EEGnet Arousal	$75.8 \pm 15.1$	$52.9 \pm 30.0$	$75.8 \pm 15.1$
Average	$75.0 \pm 14.1$	$50.9 \pm 28.3$	$75.1 \pm 14.2$
EEGNet multiclass [53]	$66.1 \pm 19.4$	$58.0 \pm 25.0$	$77.8 \pm 25.0$
TCFussionnet multiclass [62]	$57.7 \pm 15.3$	$45.7 \pm 19.8$	$71.3 \pm 10.3$

Table 4

Performance evaluation metrics across comparisons.

Comparison	Metric	Mean	Mean Std Dev	Mean KL	Mean Overlap
Training vs. Training	PC	0.026631	0.024230	0.0	1.0
	PR	0.018504	0.010019	0.0	1.0
	IOI	3.184134	1.784040	0.0	1.0
	PCH	1.269862	0.963146	0.0	1.0
Training vs. Worst Subject	PC	0.042617	0.024720	0.446502	0.635858
	PR	0.026233	0.017207	0.207528	0.776660
	IOI	0.786089	0.728793	0.228678	0.751985
	PCH	1.101409	0.637367	0.236998	0.713150
Training vs. Best Subject	PC	0.099128	0.104696	0.082781	0.841051
	PR	0.082860	0.074627	0.249435	0.721505
	IOI	4.960142	6.567809	0.516072	0.676681
	PCH	0.994342	0.588620	0.965759	0.486928

References

1. Lopez Duarte, A.E. A Progressive-Adaptive Music Generator (PAMG): An Approach to Interactive Procedural Music for Videogames. Proceedings of the FARM 2024: 12th ACM SIGPLAN International Workshop on Functional Art, Music, Modelling, and Design; New York, NY, USA, 2 September 2024; pp. 65-72.

2. Chi, X.; Wang, Y.; Cheng, A.; Fang, P.; Tian, Z.; He, Y.Y.; Liu, Z.; Qi, X.; Pan, J.; Zhang, R. et al. MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions. arXiv; 2024; arXiv: 2407.20962

3. Chen, Y.; Sun, Y. The Usage of Artificial Intelligence Technology in Music Education System Under Deep Learning. IEEE Access; 2024; 12, pp. 130546-130556. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3459791]

4. Ramaswamy, M.; Philip, J.L.; Priya, V.; Priyadarshini, S.; Ramasamy, M.; Jeevitha, G.; Mathkor, D.M.; Haque, S.; Dabaghzadeh, F.; Bhattacharya, P. et al. Therapeutic use of music in neurological disorders: A concise narrative review. Heliyon; 2024; 10, e35564. [DOI: https://dx.doi.org/10.1016/j.heliyon.2024.e35564] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39220936]

5. El-Haddad, C.; Laouris, Y. The ability of children with mild learning disabilities to encode emotions through facial expressions. Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues: Third COST 2102 International Training School, Caserta, Italy, 15–19 March 2010, Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2011; pp. 387-402.

6. Krumnikl, M.; Maiwald, V. Facial emotion recognition for mobile devices: A practical review. IEEE Access; 2024; 12, pp. 15735-15747. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3358455]

7. Duraisamy, P.; V, N.; N, K.G.; S, N. Music Generation Algorithms: An In-Depth Review of Future Directions and Applications Explored. Proceedings of the 2024 IEEE International Conference on Big Data & Machine Learning (ICBDML); Bhopal, India, 24–25 February 2024; pp. 198-203.

8. Goldman, A. Neuroscience in Music Research: Critical Challenges and Contributions. Music. Percept.; 2025; 42, pp. 179-196. [DOI: https://dx.doi.org/10.1525/mp.2024.aa001]

9. Gu, X.; Jiang, L.; Chen, H.; Li, M.; Liu, C. Exploring Brain Dynamics via EEG and Steady-State Activation Map Networks in Music Composition. Brain Sci.; 2024; 14, 216. [DOI: https://dx.doi.org/10.3390/brainsci14030216] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38539605]

10. Dash, A.; Agres, K. AI-Based Affective Music Generation Systems: A Review of Methods and Challenges. ACM Comput. Surv.; 2024; 56, pp. 1-34. [DOI: https://dx.doi.org/10.1145/3672554]

11. Bhandari, K.; Colton, S. Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation. Artificial Intelligence in Music, Sound, Art and Design; Johnson, C.; Rebelo, S.M.; Santos, I. Springer: Berlin/Heidelberg, Germany, 2024; pp. 33-51.

12. Shao, J.; Qin, P. Brain-Computer Interface: The Design of Self-Healing Music for Emotion Management. Proceedings of the 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP); Xian, China, 19–21 April 2024; pp. 1060-1064.

13. Singh, A.; Hussain, A.A.; Lal, S.; Guesgen, H.W. A comprehensive review on critical issues and possible solutions of motor imagery based electroencephalography brain-computer interface. Sensors; 2021; 21, 2173. [DOI: https://dx.doi.org/10.3390/s21062173] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33804611]

14. Jaipriya, D.; Sriharipriya, K.C. Brain computer interface-based signal processing techniques for feature extraction and classification of motor imagery using EEG: A literature review. Biomed. Mater. Devices; 2024; 2, pp. 601-613. [DOI: https://dx.doi.org/10.1007/s44174-023-00082-z]

15. Yan, Z.N.; Liu, P.R.; Zhou, H.; Zhang, J.Y.; Liu, S.X.; Xie, Y.; Ye, Z.W. Brain-computer Interaction in the Smart Era. Curr. Med Sci.; 2024; 44, pp. 1123-1131. [DOI: https://dx.doi.org/10.1007/s11596-024-2927-6] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39347924]

16. Jeannerod, M. Neural simulation of action: A unifying mechanism for motor cognition. Neuroimage; 2001; 14, pp. S103-S109. [DOI: https://dx.doi.org/10.1006/nimg.2001.0832] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/11373140]

17. Vanutelli, M.E.; Salvadore, M.; Lucchiari, C. BCI Applications to Creativity: Review and Future Directions, from little-c to C2. Brain Sci.; 2023; 13, 665. [DOI: https://dx.doi.org/10.3390/brainsci13040665] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37190630]

18. Cideron, G.; Girgin, S.; Verzetti, M.; Vincent, D.; Kastelic, M.; Borsos, Z.; McWilliams, B.; Ungureanu, V.; Bachem, O.; Pietquin, O. et al. MusicRL: Aligning Music Generation to Human Preferences. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2402.04229] arXiv: 2402.04229

19. Chen, Y.; Huang, L.; Gou, T. Applications and Advances of Artificial Intelligence in Music Generation:A Review. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2409.03715] arXiv: 2409.03715

20. Ran, S.; Zhong, W.; Ma, L.; Duan, D.; Ye, L.; Zhang, Q. Mind to Music: An EEG Signal-Driven Real-Time Emotional Music Generation System. Int. J. Intell. Syst.; 2024; 2024, 9618884. [DOI: https://dx.doi.org/10.1155/int/9618884]

21. Cui, X.; Wu, Y.; Wu, J.; You, Z.; Xiahou, J.; Ouyang, M. A review: Music-emotion recognition and analysis based on EEG signals. Front. Neuroinform.; 2022; 16, 997282. [DOI: https://dx.doi.org/10.3389/fninf.2022.997282] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36387584]

22. Kang, J.; Herremans, D. Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges. arXiv; 2024; arXiv: 2406.08809

23. Bellos, C.V.; Stefanou, K.; Tzallas, A.T.; Stergios, G.; Tsipouras, M.G. Methods and Approaches for User Engagement and User Experience Analysis Based on Electroencephalography Recordings: A Systematic Review. Electronics; 2025; 14, 251. [DOI: https://dx.doi.org/10.3390/electronics14020251]

24. Li, H.; Zeng, Y.; Bai, Z.; Li, W.; Wu, K.; Zhou, J. EEG-fNIRS-Based Music Emotion Decoding and Individualized Music Generation. Proceedings of the 2024 5th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI); Nanchang, China, 27–29 September 2024; pp. 394-397.

25. Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing; 2023; 561, 126866. [DOI: https://dx.doi.org/10.1016/j.neucom.2023.126866]

26. Chen, H.; Zeng, W.; Cai, L.; Li, Y.; Wang, L.; Lu, J.; Yan, H.; Siok, W.T.; zhuan Wang, N. You Only Acquire Sparse-channel (YOAS): A Unified Framework for Dense-channel EEG Generation. arXiv; 2024; arXiv: 2406.15269

27. Gundu, S.R.; Panem, C.A. Chapter 14—A review on contemporary brain–computer interface researches and limitations. Brain-Computer Interfaces; El-Baz, A.S.; Suri, J.S. Advances in Neural Engineering Academic Press: Cambridge, MA, USA, 2025; pp. 287-295.

28. Williams, D.; Kirke, A.; Miranda, E.R.; Roesch, E.; Daly, I.; Nasuto, S. Investigating affect in algorithmic composition systems. Psychol. Music; 2015; 43, pp. 831-854. [DOI: https://dx.doi.org/10.1177/0305735614543282]

29. Noor, M.H.M.; Ige, A.O. A Survey on State-of-the-art Deep Learning Applications and Challenges. arXiv; 2024; arXiv: 2403.17561

30. Mukhamediev, R.I.; Popova, Y.; Kuchin, Y.I.; Zaitseva, E.N.; Kalimoldayev, A.; Symagulov, A.; Levashenko, V.G.; Abdoldina, F.; Gopejenko, V.I.; Yakunin, K. et al. Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities and Challenges. Mathematics; 2022; 10, 2552. [DOI: https://dx.doi.org/10.3390/math10152552]

31. Liang, J. Harmonizing minds and machines: Survey on transformative power of machine learning in music. Front. Neurorobot.; 2023; 17, 1267561. [DOI: https://dx.doi.org/10.3389/fnbot.2023.1267561] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38023456]

32. Jaiswal, G.; Rani, R.; Mangotra, H.; Sharma, A. Integration of hyperspectral imaging and autoencoders: Benefits, applications, hyperparameter tunning and challenges. Comput. Sci. Rev.; 2023; 50, 100584. [DOI: https://dx.doi.org/10.1016/j.cosrev.2023.100584]

33. Lin, Y.; Dai, Z.; Kong, Q. MusicScore: A Dataset for Music Score Modeling and Generation. arXiv; 2024; arXiv: 2406.11462

34. Mou, L.; Li, J.; Li, J.; Gao, F.; Jain, R.C.; Yin, B. MemoMusic: A Personalized Music Recommendation Framework Based on Emotion and Memory. Proceedings of the 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR); Tokyo, Japan, 22–24 March 2021; pp. 341-347.

35. Zhou, Y. Music Generation Based on Bidirectional GRU Model. Highlights Sci. Eng. Technol.; 2024; 85, pp. 684-690. [DOI: https://dx.doi.org/10.54097/t2szjs78]

36. Fernando, P.; Mahanama, T.V.; Wickramasinghe, M. Assessment of Human Emotional Responses to AI–Composed Music: A Systematic Literature Review. Proceedings of the 2024 International Research Conference on Smart Computing and Systems Engineering (SCSE); Colombo, Sri Lanka, 4 April 2024; Volume 7, pp. 1-6.

37. Ye, H. Research on the Application of Intelligent Algorithms in the Automation of Music Generation and Composition. Proceedings of the 2024 International Conference on Computers, Information Processing and Advanced Education (CIPAE); Ottawa, ON, Canada, 26–28 August 2024; pp. 668-673.

38. Sun, Y.; Kuo, M.; Wang, X.; Li, W.; Bai, Q. Emotion-Conditioned MusicLM: Enhancing Emotional Resonance in Music Generation. Proceedings of the 2024 IEEE Congress on Evolutionary Computation (CEC); Yokohama, Japan, 30 June–5 July 2024; pp. 1-8.

39. Jamshidi, F.; Pike, G.; Das, A.; Chapman, R. Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey. arXiv; 2024; arXiv: 2406.15249

40. Kong, X. Deep Learning in Music Generation: A Comprehensive Investigation of Models, Challenges and Future Directions. ITM Web of Conferences; EDP Sciences: Ulis, France, 2025; Volume 70, 04027.

41. Lam, M.W.Y.; Tian, Q.; Li, T.; Yin, Z.; Feng, S.; Tu, M.; Ji, Y.; Xia, R.; Ma, M.; Song, X. et al. Efficient Neural Music Generation. Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S. Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 17450-17463.

42. Lin, Y.X.; Lin, J.C.; Wei, W.L.; Wang, J.C. Learnable Counterfactual Attention for Music Classification. IEEE Trans. Audio Speech Lang. Process.; 2025; 33, pp. 570-585. [DOI: https://dx.doi.org/10.1109/TASLPRO.2025.3527143]

43. Thundiyil, S.; Shalamzari, S.; Picone, J.; McKenzie, S. Transformers for Modeling Long-Term Dependencies in Time Series Data: A Review. Proceedings of the 2023 IEEE Signal Processing in Medicine and Biology Symposium (SPMB); Philadelphia, PA, USA, 2 December 2023; pp. 1-5.

44. Wang, S.; Du, Y.; Guo, X.; Pan, B.; Zhao, L. Controllable Data Generation by Deep Learning: A Review. ACM Comput. Surv.; 2022; 56, pp. 1-38. [DOI: https://dx.doi.org/10.1145/3648609]

45. Ji, S.; Yang, X.; Luo, J. A Survey on Deep Learning for Symbolic Music Generation: Representations, Algorithms, Evaluations, and Challenges. ACM Comput. Surv.; 2023; 56, pp. 1-39. [DOI: https://dx.doi.org/10.1145/3571730]

46. Ji, S.; Luo, J.; Yang, X. A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions. arXiv; 2020; arXiv: 2011.06801

47. Wen, Y.W.; Ting, C.K. Recent Advances of Computational Intelligence Techniques for Composing Music. IEEE Trans. Emerg. Top. Comput. Intell.; 2023; 7, pp. 578-597. [DOI: https://dx.doi.org/10.1109/TETCI.2022.3221126]

48. Li, D.; Zang, Y.; Kong, Q. Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders. arXiv; 2025; arXiv: 2501.03038

49. Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic music transcription: An overview. IEEE Signal Process. Mag.; 2018; 36, pp. 20-30. [DOI: https://dx.doi.org/10.1109/MSP.2018.2869928]

50. Sarker, S.; Sarker, P.; Stone, G.; Gorman, R.; Tavakkoli, A.; Bebis, G.; Sattarvand, J. A comprehensive overview of deep learning techniques for 3D point cloud classification and semantic segmentation. Mach. Vis. Appl.; 2024; 35, 67. [DOI: https://dx.doi.org/10.1007/s00138-024-01543-1]

51. Bhurtel, M.; Rawat, D.B.; Rice, D.O. Recent advances on generative models for semantic segmentation: A survey. Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications VI; Schwartz, P.J.; Jensen, B.; Hohil, M.E. International Society for Optics and Photonics, SPIE: Bellingham, DC, USA, 2024; Volume 13051, 1305113.

52. Zhang, A.; Lipton, Z.C.; Li, M.; Smola, A.J. Dive into deep learning. arXiv; 2021; arXiv: 2106.11342

53. Lawhern, V.J.; Solon, A.J.; Waytowich, N.R.; Gordon, S.M.; Hung, C.P.; Lance, B.J. EEGNet: A compact convolutional neural network for EEG-based brain–computer interfaces. J. Neural Eng.; 2018; 15, 056013. [DOI: https://dx.doi.org/10.1088/1741-2552/aace8c] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29932424]

54. Álvarez Meza, A.M.; Torres-Cardona, H.F.; Orozco-Alzate, M.; Pérez-Nastar, H.D.; Castellanos-Dominguez, G. Affective Neural Responses Sonified through Labeled Correlation Alignment. Sensors; 2023; 23, 5574. [DOI: https://dx.doi.org/10.3390/s23125574] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37420740]

55. Tobón-Henao, M.; Álvarez Meza, A.M.; Castellanos-Dominguez, C.G. Kernel-based regularized EEGNet using centered alignment and Gaussian connectivity for motor imagery discrimination. Computers; 2023; 12, 145. [DOI: https://dx.doi.org/10.3390/computers12070145]

56. Han, D.; Kong, Y.; Han, J.; Wang, G. A survey of music emotion recognition. Front. Comput. Sci.; 2022; 16, 166335. [DOI: https://dx.doi.org/10.1007/s11704-021-0569-4]

57. Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput.; 2011; 3, pp. 18-31. [DOI: https://dx.doi.org/10.1109/T-AFFC.2011.15]

58. Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol.; 1980; 39, 1161. [DOI: https://dx.doi.org/10.1037/h0077714]

59. Ding, Y.; Robinson, N.; Zhang, S.; Zeng, Q.; Guan, C. Tsception: Capturing temporal dynamics and spatial asymmetry from EEG for emotion recognition. arXiv; 2021; arXiv: 2104.02935[DOI: https://dx.doi.org/10.1109/TAFFC.2022.3169001]

60. Bittner, R.M.; Bosch, J.J.; Rubinstein, D.; Meseguer-Brocal, G.; Ewert, S. A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); Singapore, 22–27 May 2022.

61. Raffel, C.; Ellis, D.P. Intuitive analysis, creation and manipulation of MIDI data with pretty_midi. Proceedings of the 15th International Society for Music Information Retrieval Conference Late Breaking and Demo Papers; Taipei, Taiwan, 27–31 October 2014; pp. 84-93.

62. Musallam, Y.K.; AlFassam, N.I.; Muhammad, G.; Amin, S.U.; Alsulaiman, M.; Abdul, W.; Altaheri, H.; Bencherif, M.A.; Algabri, M. Electroencephalography-based motor imagery classification using temporal convolutional network fusion. Biomed. Signal Process. Control; 2021; 69, 102826. [DOI: https://dx.doi.org/10.1016/j.bspc.2021.102826]

63. He, Y.; Liu, Z.; Chen, J.; Tian, Z.; Liu, H.; Chi, X.; Chen, Q. LLMs Meet Multimodal Generation and Editing: A Survey. arXiv; 2024; arXiv: 2405.19334

64. Wang, L.; Zhao, Z.; Liu, H.; Pang, J.; Qin, Y.; Wu, Q. A review of intelligent music generation systems. Neural Comput. Appl.; 2024; 36, pp. 6381-6401. [DOI: https://dx.doi.org/10.1007/s00521-024-09418-2]

65. Nguyen, P.; Tsabary, E. Towards Deconstructivist Music: Reconstruction paradoxes, neural networks, concatenative synthesis and automated orchestration in the creative process. Organised Sound; 2024; 29, pp. 79-90. [DOI: https://dx.doi.org/10.1017/S1355771823000237]

66. Bellier, L.; Llorens, A.; Marciano, D.; Gunduz, A.; Schalk, G.; Brunner, P.; Knight, R.T. Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLoS Biol.; 2023; 21, e3002176. [DOI: https://dx.doi.org/10.1371/journal.pbio.3002176] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37582062]

67. Ramirez-Aristizabal, A.G.; Kello, C. EEG2Mel: Reconstructing Sound from Brain Responses to Music. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2207.13845] arXiv: 2207.13845

68. Wagh, K.P.; Vasanth, K. Performance evaluation of multi-channel electroencephalogram signal (EEG) based time frequency analysis for human emotion recognition. Biomed. Signal Process. Control; 2022; 78, 103966. [DOI: https://dx.doi.org/10.1016/j.bspc.2022.103966]

Word count: 9390

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Advancements in music emotion prediction are driving AI-driven algorithmic composition, enabling the generation of complex melodies. However, bridging neural and auditory domains remains challenging due to the semantic gap between brain-derived low-level features and high-level musical concepts, making alignment computationally demanding. This study proposes a deep learning framework for generating MIDI sequences aligned with labeled emotion predictions through supervised feature extraction from neural and auditory domains. EEGNet is employed to process neural data, while an autoencoder-based piano algorithm handles auditory data. To address modality heterogeneity, Centered Kernel Alignment is incorporated to enhance the separation of emotional states. Furthermore, regression between feature domains is applied to reduce intra-subject variability in extracted Electroencephalography (EEG) patterns, followed by the clustering of latent auditory representations into denser partitions to improve MIDI reconstruction quality. Using musical metrics, evaluation on real-world data shows that the proposed approach improves emotion classification (namely, between arousal and valence) and the system’s ability to produce MIDI sequences that better preserve temporal alignment, tonal consistency, and structural integrity. Subject-specific analysis reveals that subjects with stronger imagery paradigms produced higher-quality MIDI outputs, as their neural patterns aligned more closely with the training data. In contrast, subjects with weaker performance exhibited auditory data that were less consistent.

Details

Title

EEG-Based Music Emotion Prediction Using Supervised Feature Extraction for MIDI Generation

Author

Gomez-Morales, Oscar¹

; Perez-Nastar, Hernan²; Andrés Marino Álvarez-Meza²

; Torres-Cardona, Héctor³

; Castellanos-Dominguez, Germán²

¹ Faculty of Systems and Telecommunications, Universidad Estatal Península de Santa Elena, Avda. La Libertad, Santa Elena 7047, Ecuador; [email protected]
² Signal Processing and Recognition Group, Universidad Nacional de Colombia, Manizales 170003, Colombia; [email protected] (A.M.Á.-M.); [email protected] (G.C.-D.)
³ Transmedia Research Center, Universidad de Caldas, Manizales 170003, Colombia; [email protected]

First page

1471

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

14248220

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/s25051471

ProQuest document ID

3176350326

EEG-Based Music Emotion Prediction Using Supervised Feature Extraction for MIDI Generation

Jump to:

Full text

2.1. Auditory Feature Extraction

2.1.1. Neural Network-Based Auditory for Audio Conversion

2.1.2. Autoencoder-Based Extraction of Piano-Roll Features

2.2. Supervised Extraction of Neural Responses

2.3. Affective-Based Prediction of MIDI Data

2.3.1. Kernel-Based Alignment Between Supervised Auditory and Neural Features

2.3.2. k-Neighbors-Based Emotion Label Prediction of MIDI Representations

3. Experimental Set-Up

3.1. Evaluating Framework and Tested Dataset

3.1.1. EEG Data Description

3.1.2. Auditory Data Description

3.2. Enhanced Feature Extraction Using Deep Learning Models

3.2.1. Deep Feature Extraction from EEG Data

3.2.2. Deep Feature Extraction from MIDI Data

3.3. Proposed Deep Alignment Between Enhanced EEG and MIDI Feature Sets

3.4. Evaluation Performance

3.4.1. Musical Evaluation Metrics

3.4.2. Performance Classification Measures of Neural Responses

4.1. Experiment 1: Comparison with State-of-the-Art Models

4.2. Experiment 2: MIDI Reconstruction

4.3. Experiment 3: Performance Evaluation Across Subjects

5. Concluding Remarks

Abstract

Details

Suggested sources