1. Introduction
The technology of recognizing human emotion is essential to developing a human computing interface (HCI) for human–robot interactions. Furthermore, emotion recognition is attracting attention in other fields, including artificial intelligence. In recent studies, emotion recognition can be divided into two areas according to the type of data used. One area is external emotion recognition based on voice or image. Because the current voice information is sparse, there is a fundamental limitation in extracting continuous emotions. Research on emotion recognition has recently been conducted using dialogue context and voice tone [1,2]. On the other hand, emotion recognition research based on image information has shown the best performance. Emotions are categorized by recognizing changes in facial expressions based mainly on the features obtained from facial images [3,4]. Recently, mechanisms for classifying emotions end-to-end using a convolution neural network without an additional feature extraction process have been developed and have performed with high accuracy [5,6].
In contrast, internal emotion recognition uses the changes in bio-signals to recognize emotions without considering external changes. Emotion recognition is conducted using an electroencephalogram (EEG) produced from electrical signals generated in the human brain [7,8]. Frequency-domain features, such as power spectral density (PSD), are extracted, and emotions are recognized using typical learning algorithms. Recent studies have improved the accuracy of emotion recognition by extracting the features of asymmetry between the channels of the EEG signal and applying deep learning algorithms [2,9,10]. On the other hand, emotion recognition using bio-signals presents difficulties in acquiring data. Public datasets, such as DEAP [11] and MAHNOB-HCI [12], are often used. The type of data strongly influences the results of bio-signal based emotion recognition studies. Previous studies used a Western-based dataset and various EEG features [13,14].
This study aimed to improve the accuracy of emotion classification by searching the models, features, and channels used for emotion recognition using a Korean dataset instead of a public dataset to perform emotion recognition based on the EEG signals. The features were extracted from the time, frequency, and time–frequency domains for comparison. A genetic algorithm (GA) was used to select valid features, and a learning algorithm was trained based on long short-term memory (LSTM). From the LSTM model results, all features were weighted using repeated feature evaluations by the GA. To evaluate all features consistently, the learning used the maximum number of repetitions, mixed randomly, to prevent the features from being removed by the GA. After completing the proposed GA–LSTM process, the models, channels, and features of the EEG signal that were valid for emotion recognition could be distinguished by weighting the model parameters and features. Experiments using the MAHNOB-HCI dataset with continuous-time annotation applied to the valence label and other experiments using the DEAP dataset showed that the proposed method outperformed the continuous emotion recognition method using the representative PSD feature [15] and the recent emotion classification methods [16,17,18]. To appraise the Asian-oriented dataset, the multimodal emotion for the robot interface was constructed using the Asian physiological signals (MERTI-Apps) dataset, which is similar to the MAHNOB-HCI dataset. By applying the model to the MERTI-Apps dataset, this study verified that the proposed technique could provide effective model selection and feature group selection.
The following are the key contributions of this paper:
-
A GA for a deep-learning network was proposed to select effective network models and search for valid EEG channels and features. GA was combined with LSTM to select the effective LSTM models and remove the features that interfere with emotion recognition judgment. The proposed GA-LSTM architecture outperforms previous methods [15,16,17,18] in terms of the regression performance (RMSE) and classification performance (accuracy).
- An Asian physiological signals (MERTI-Apps) dataset that provides valence and arousal annotation labeling and various feature sets (time, frequency, time–frequency domain, and brain lateralization feature) was established to perform emotion recognition based on the EEG signals.
- The brain lateralization feature further enhances the outcome of emotional recognition.
This paper is organized as follows. Section 2 reviews the related work, and Section 3 describes the MAHNOB-HCI and DEAP public datasets and the MERTI-Apps dataset. Section 4 details the proposed method, and Section 5 presents the experimental results and analysis of further experiments on brain deflection. Finally, the conclusions are presented in Section 6.
2. Related Works This section introduces the HCI and previous studies using public datasets for biological signals and emotion recognition. 2.1. Emotion Recognition for Human Computing Interaction
Human emotions have long been equated to facial expressions. Therefore, methods based on facial images have been developed intensively for emotion recognition. Tong et al. detected the area where movement is activated in a facial image and determined the action unit (AU) according to the facial position and shape [19]. Facial expressions were classified according to the AU type and combination. On the other hand, as the performance of algorithms for extracting landmarks has been improved dramatically, new methods for determining facial expressions have emerged [20]. In this context, landmarks are defined as the location information of crucial points in facial images, and facial expressions are characterized by the location of each landmark and the distance between markers. In an actual environment, human emotions may differ from those in the datasets. For example, the same emotions may have different means of expression, or emotions may change without being revealed by facial expressions. Therefore, it is desirable to use information, such as bio-signals and facial images, for accurate emotion recognition in an actual environment.
An EEG records the electrical signals generated by the human brain, which can be used for internal emotion recognition. The state of emotion can be estimated by analyzing the position and waveform characteristics of the EEG. The DEAP dataset and the MAHNOB-HCI dataset, which measure the EEG in the context of human emotion, were released in 2012. Soleymani et al. [15] and Kim et al. [21] developed representative EEG-based emotion recognition techniques. Soleymani et al. proposed a method for the regression of positive/negative valence values in real time using the MAHNOB-HCI dataset annotated in a continuous-time domain. Kim et al. reported that the channel activating the EEG signals varies with the state of emotion. Moreover, the features extracted using the connectivity between the EEG channels were significant. Convolutional LSTM was applied to the extracted features to classify the emotional states in the valence and arousal domains [7,22]. A paper surveying recent research [14] showed that classifier research is dominant. Classifier research [16,17,18,23] was still active, and emotion recognition in the order of image, voice, and brain waves showed good accuracy. On the other hand, the accuracy of EEG was still lower compared to that of the video and audio fields. This can be seen as the need for regression and fusion with other biological signals. In addition to emotion recognition, the field of brain–machine interface systems analyzes EEG signals to improve performance. The technology to understand the driver’s condition and apply it to future automotive technology is very important. Gao et al. [24] measured a driver’s fatigue and applied features, including PSD, to the model to improve performance. Because an analysis of EEG is difficult, there is active research on finding effective features and channels for emotional recognition [18,25,26]. Hao Chao et al. [25] reported an improvement in performance using the maximal information coefficient (MIC) feature. Li et al. [26] adopted a multiple feature fusion approach to combine the compensative activation and connection information for emotion recognition. They selected the optimal feature subsets among the feature combinations after determining the combination of the feature sets. Wang et al. [18] selected effective channels using an EEG spectrogram and normalized mutual information. Becker et al. [23] provided a database of High-Resolution (HR) EEG recordings to analyze emotions elicited by film excerpts and other simultaneously recorded physiological measurements: galvanic skin response (GSR), electrocardiogram (ECG), respiration, peripheral oxygen saturation, and pulse rate. The classification accuracy using brain source reconstruction was above 70% or 75%. They analyzed the frequency of feature selection for each channel/brain region, which was averaged over all subjects. On the other hand, they did not compare the classification performance with other databases, nor did they consider the feature selection technique. Halim et al. [27] presented a method to identify dynamically the drivers’ emotions and applied an ensemble of Support Vector Machine (SVM), Random Forest (RF), and Neural Network (NN) classifiers. Among them, the classification accuracy using SVM was 88.75%. They also used feature selection using principal component analysis (PCA), mutual information (MI), and random subset feature selection (RSFS). Krishna et al. [28] presented an emotion classification using a generalized mixture model and achieved 89% classification accuracy using the confusion matrix for happy, sad, bored, and neutral in terms of valence. On the other hand, they only collected the EEG dataset using six mentally impaired subjects and used features, such as Mel-frequency Cepstrum Coefficients, but they did not apply feature or channel selection.
This study did not determine the initial models and feature groups. All models, feature groups, and channels were evaluated by the GA to explore the optimal models and feature subset. Through the GA, the models, features, and channels that interfere with emotion recognition were removed to show a better effect. 2.2. Biosignals and Dataset
With the rapid development of deep learning technology in emotion recognition, research is actively being conducted to classify emotions by combining biosignals other than EEG. As shown in Figure 1, the central nervous system (CNS) signal refers to the EEG signal generated from the brain. The peripheral nervous system (PNS) signals can be observed using electrooculogram (EOG), electromyogram (EMG), photoplethysmogram (PPG), and galvanic skin response (GSR), which indicate the electrical signals generated through muscle movements and eye movements. The EEG is the most critical signal for internal emotion recognition. Nevertheless, severe artifacts can result from data acquisition, noise, temperature, and individual differences because EEG signals are very susceptible. The PNS signals can be generalized differently than the EEG signal. Research often uses PNS signals as an aid because it is challenging to classify emotions using only EEG signals. One study [29] used the heart rate information obtained from an electrocardiogram (ECG) to recognize emotion by assessing the facial muscle movement through an EMG signal. Another survey [30] measured facial expressions through the tail muscles to read the changes in emotion. In those studies, the EEG signals use the PSD features. Few studies have defined useful features other than PSD for emotion recognition.
Bio-signal data, including EEG signals, require specialized equipment, but the experimental conditions where emotions are felt are difficult to reproduce. Therefore, most studies use an open dataset. Table 1 lists the datasets according to the research purpose. On the other hand, it is unclear whether the experimental results will apply equally to an Asian population because most datasets are Western-based, including the MAHNOB-HCI.
3. Multimodal Datasets
Three datasets were used in this study: the MAHNOB-HCI dataset [12], which is an open dataset with observer tagging for valence for continuous emotion recognition; the DEAP dataset [11], which is a multimodal dataset for analyzing people’s discrete emotions; and the MERTI-Apps dataset produced for this study for use with Asian populations. This chapter describes three datasets and introduces the annotation-labeling program used in MERTI-Apps.
3.1. MAHNOB-HCI and DEAP Datasets
Twenty videos were used in the MANHOB-HCI dataset [12] (hereafter, MANHOB dataset) to induce continuous emotions. In the preliminary study, the participants assisted in video selection by reporting their feelings through their self-assessment. The excerpt’s 20 source videos include emotions, such as disgust, amusement, joy, fear, sadness, and neutral. The length of each video clip was approximately 34 to 117 s. The participants comprised 27 healthy subjects: 13 males and 16 females. The EEG signals were obtained using 32 activation electrodes positioned according to the criteria of the Biosemi Active II system [34] and the 10–20 International System [35]. The faces of the participants were filmed in a 720 × 580 video at 60 Hz and synchronized with the EEG signal. Table 2 provides a detailed description of the dataset production. Two hundred and thirty-nine records were produced, each with corresponding label information. In addition, five educated commentators provided continuous comments on the participants’ facial expressions, and the valence of the frontal facial expressions was determined using FEELTRACE [36] and joysticks.
The DEAP dataset [11] is a multimodal dataset for analyzing people’s discrete emotions. The dataset is the extraction of EEG and peripheral physiological signals of 32 participants. In the extraction process, 40 music videos were viewed at 1 minute intervals. As shown in Table 3, the participants were 32 subjects: 16 males and 16 females. The participants rated the labels of each video as arousal, valence, like/dislike, dominance, and familiarity. The data collected were EEG, EMG, EOG, GSR, Blood Volume Pulse (BVP), temperature, and breathing data. The length of the video was extracted in 1 min segments through highlighted detection. The EEG data were collected from 32 electrodes at a sampling rate of 512 Hz, and PNS bio-signals were collected from 13 electrodes with four EOG, four EMG, two GSR, one BVP, temperature, and breathing signals.
3.2. MERTI-Apps Dataset
The MERTI-Apps dataset collected the records of multimodal recordings of participants in their responses to fifteen emotional videos. Initially, the classification system with eight emotions was established based on the arousal and valence domains: happiness, excited, sad, bored, disgust, anger, calm, and comfortable, to build an experimental video set. The videos to be presented to induce emotions were searched randomly and collected on YouTube using the emotional vocabulary keywords. Five to seven research assistants reviewed the collected videos, and videos judged to induce emotions were classified by search keywords. To select videos optimized for emotion induction, the content validity of the emotion inducement suitability (emotion type, intensity, etc.) was then checked through a field test. As a result, the final emotion collection videos were selected except for images with a Content Validity Index (CVI) of 2.5 or lower. The final 32 videos were selected from the four valence–arousal bases (high arousal positive valence: HAPV, high arousal negative valence: HANV, low arousal positive valence: LAPV, low arousal negative valence: LANV) and neutral domains. Among them, 15 were selected for the participant’s concentration time. Each video spanned from 60 to 206 s, an average length of 114.6 s. Sad emotions were difficult to induce quickly compared to other public databases, so the length of sad images was longer than that of other databases. The recruitment process, consent, laboratory environment, emotion-inducing videos, participant treatment, follow-up management, and other measurement items showed a CVI value of 0.92 between experts. Expert CVI measurements secured the validity, and a pilot test was conducted. This video selection process was conducted by a research team at Inha University Department of Social Welfare. Table 4 lists the three experiments. Experiment 1 was performed using EMG and EOG, excluding EEG to analyze the peripheral nervous signals other than EEG. Experiment 2 considered only emotional recognition using the CNS, such as the EEG signal, and measured the EEG, EMG, and EOG signals together to remove the noise caused by movements of the head and facial muscles using EMG and EOG signals. In this case, the problem of the EMG electrode covering the face was identified. In experiment 3, the PPG and GSR signals, which are from the peripheral nervous system, were synchronized with the EEG signal, excluding the EMG signal, and different nerve signals, such as CNS and PNS, were measured in a multimodal form.
The participants consisted of 62 healthy subjects: 28 males and 34 females aged from 21 to 29 years. Figure 2 shows the data collecting procedure. The EEG signals were obtained with 12 active electrodes located in accordance with the 10–20 International System using a BIOPAC MP150 instrument. Along with the EEG signal, a video of the participant’s face was shot in 1080 p at 30 Hz. Because the video runtime was short, an EOG channel was used to remove the artifacts caused by eye blinks. The participant’s self-emotional questionnaire and data inconsistent with the evaluation were excluded from the records. The valid records in the MERTI-Apps dataset were 283, 312, and 236 in experiments 1, 2, and 3, respectively. Initially, in each experiment, 320 records were produced using 62 participants and five videos. The records causing serious artifacts from the induced emotion were excluded. In experiment 3, 236 records were used for annotation labeling, which was designed without discomfort in facial movement. Each record contained the corresponding valence and arousal label information. As with the MAHNOB dataset, trained commentators commented continuously on the participants’ facial expressions, and the program evaluated the valence and arousal of frontal facial expressions. The measurement program related to annotation labeling was recorded, as shown in Figure 3.
3.3. Annotation Labeling Annotation labeling was performed by the observer to evaluate the emotion of the bio-signals in the MERTI-Apps dataset. Only one emotion was induced in the video, and it was confirmed that the induced emotion and the participant’s emotion matched. The participant’s valence and arousal were evaluated using the recorded facial video. After viewing the video, the participant produced a self-assessment label, which displayed the emotions they felt. The participant’s self-emotional questionnaire and data inconsistent with the evaluation were excluded from the experiment. Arousal is a measure of the level of the excitement of emotions. A smaller arousal value indicates calmer emotions, which corresponds to boredom, comfort, and sleepiness. On the other hand, high-arousal emotions include excitement, anger, and fear. The valence represents a positive or negative degree of emotion. Fear has a very negative valence. Boredom or excitement has a moderate valence. Happiness or comfort has a positive valence. As it is difficult to subdivide emotions further using the valence alone, this study evaluated arousal, which refers to feelings, from high to low. Both valence and arousal were evaluated in the range of −100 to 100, and data collection occurred at 0.25 s intervals in accordance with the bio-signal data.
The most important aspect when evaluating annotation labeling is that observers should make the most objective evaluation possible. Therefore, an in-house annotation-labeling program was used. The observer consisted of five males and females aged 22–25 years old. After training and coordinating the observers’ opinions of the evaluation of the participants’ emotions through the pilot experiment video, the labeling work of the observers began in earnest. In case of disagreement, two additional observers excluded the abnormal values and used the average valence or arousal of the remaining observers. The video and labeling data were matched by displaying the bio-signal data and the point at the start and end of the face video of the participant in the experiment. The observers watched the facial video and evaluated the valence and arousal through a scroll bar on the right side of the program at 0.25 s intervals. The five observers recorded the labeling data (see Figure 3d), which were used as the target data. To find data with the same emotion, this study checked whether the emotion-inducing video and the actual emotion felt by the participant were the same. Only data whose labeling value matched were used in the experiment.
4. Proposed Method
The features of the EEG signal were extracted, as shown in Table 5, to derive an effective group of features for emotion recognition through deep learning. The proposed deep learning model was combined with the GA–LSTM, as shown in Figure 4. The initialization step was performed randomly to transmit effective features in the GA continuously.
4.1. Feature Extraction
The active features for emotion recognition were selected by extracting the EEG features from three domains (time, frequency, and time–frequency), as summarized in Table 5. The proposed method can distinguish which channels and features of the EEG signals are valid for emotion recognition through the weighting of features. This is why the previous work [15] used only PSD features, whereas various feature sets were used in the present study. Thirty-seven EEG features were extracted per channel, some of which were used in addition to the PSD in other studies [9,37]. A variety of factors, such as the small movement of the participants, sweat, body temperature, and tension, act as noise in the EEG signal. Therefore, a notch filter on Biopac’s M150 equipment was used to eliminate very-low-frequency bands and >50 Hz frequency bands. The eyeblink pattern was also removed via the EOG signal. A fast Fourier transform (FFT) was used in the frequency domain to divide the EEG signal into slow alpha, alpha, beta, and gamma waves. The features were extracted and converted to a one-dimensional vector and used as the input data. The final feature dimension was calculated as (number of channels) × (amount of features). For the MAHNOB dataset, the feature dimension was 32 × 37 = 1184. For MERTI-Apps, the feature dimension was 12 × 37 = 444. In addition, a brain deflection feature consisting of 1:5, 5:1, 3:3, and 5:5 was used. A feature was extracted from the previous 2 s data stream in the proposed technique because the data corresponded to a continuous-time annotation.
Time domain feature. A characteristic of the time domain can be represented as the change in the EEG signal over time. In the time domain, the mean, minimum, maximum, 1st difference, and normalized 1st difference were used to recognize changes in emotions over time [33].
Frequency domain feature. In previous studies, the components of the frequency domain were used because of the excellent spatial resolution of the EEG. Outputs appearing in different frequency bands are good factors to detect different emotional states. In this study, the PSD features were divided into four areas, Slow Alpha (8–10 Hz), Alpha (8–12.9 Hz), Beta (13–29.9 Hz), and Gamma (30–50 Hz), and extracted. The mean, max, and integral values were compared in the PSD features.
Time–frequency domain feature. The time–frequency domain was divided into four frequency domains, and five characteristics were selected. A discrete wavelet transform (DWT) decomposes a signal into bits based on time. This feature was used in the speech field [38] and in the field of emotion recognition [39]. The mean, maximum, and absolute values were used in each frequency range of DWT; the log and absolute (log) values were also used.
Brain lateralization feature. The values of both electrodes are expressed in the power spectrum. The deflection value of two electrodes is in the range of 0 to 1. A deflection value close to 1 means that the two electrodes are strongly connected, and a deflection value close to 0 means that the correlation between the two electrodes is small.
In this study, the brain lateralization features were extracted in four ways, as depicted in Figure 5.
- 1 × 5: when lateralization change from left-brain center to the right brain occurs, ensure the lateralization between one left-center electrode and five right-brain electrodes
- 5 × 1: when lateralization change from right-brain center to the left brain occurs, measure the lateralization between five right-brain electrodes and one left-brain center electrode
- 3 × 3, 5 × 5: when lateralization change between the left brain and the right brain occurs, measure the lateralization between the same number of electrodes in the left and right brains
Figure 5 shows the 10–20 system, and the locations of the channels used are the same as in Figure 6. Figure 5 is simplified because it is difficult to see when all electrodes are displayed.
4.2. Genetic Algorithm
GA is a heuristic search algorithm that mimics genetic biological evolution as a problem-solving strategy. Algorithm 1 lists the detailed pseudo code of the genetic algorithm. In the first stage, a genetic algorithm (GA) was used to select an effective LSTM model suitable for the given data, as shown in Figure 4. Forty percent of the population were chosen randomly to produce parent objects. After learning, it selected the top 20% of the parent objects and moved them to the next-generation model. In addition, the child objects were produced using selection, mutation, and crossover. This process was repeated until the 10th next-generation model was produced, or the RMSE of the current model was no longer improved. The probability of a mutation was set to 10%, and the mutation process only updated the coefficients of the epoch, LSTM_cell, and drop_out of the models. It was assumed that the epoch, LSTM_cell, drop_out, activation, and optimizer parameters of the models were set randomly at the initial time.
Algorithm 1. The pseudo-code of the genetic algorithm. | |
Genetic Algorithm | |
1. | Input data: EEG_data |
2. | Output data: model_result, features_result; |
3. | |
4. | set parameters NoOfPopulations, NoOfGenerations, threshold; |
5. | |
6. | GA-Model_Selection: |
7. | Generate randomly model_populations using Epochs, lstm_cells, Activation, Dropout, Optimize and repeat this step until NoOfPopulations; |
8. | Learn model_populations using input data in lstm and then save only 20% of top results for next generation, and repeat this step until NoOfGenerations |
9. | if (result_RMSE > threshold) { |
10. | Crossover(); // create child objects by mixing parent objects |
11. | Mutate(); // re-determines Epochs, lstm_cell, and Dropout within probability 10% |
12. | repeat GA-Model_Selection; |
13. | } |
14. | else |
15. | save model_result; |
16. | |
17. | GA-Feature_Selection: |
18. | Generate randomly feature_populations using Input_data; |
19. | Learn feature_populations using input data and model_result and save only 20% of top results for next generation, and repeat this step until NoOfGenerations; |
20. | if (result_RMSE > threshold) { |
21. | Crossover(); // create child objects by mixing parent objects |
22. | Mutate(); // re-determine feature_list using input_data within probability 10% |
23. | } |
24. | else |
25. | save features_result; |
In the second stage, the dominant feature set was selected using GA again and the predetermined LSTM model. This stage selects 25–50% of the features in the whole records randomly. Invalid features may confuse the training model if all features are used for LSTM training. As shown in Figure 6, approximately 20% of all the features in MAHNOB-HCI performed well. If all the features with good performance are in the set, it is difficult to add new features and evaluate them. Therefore, 25–50% of features were used as the dominant features. To produce the next-generation features, the GA consisted of three main tasks: selection, crossover, and mutation. The feature set was an array of integers, where each integer represented a weight. Selection was to choose features with high weights from the dominant features to enter the next dominant features. The crossover swaps the features remaining after being selected from the unused features to the high-weight features. Mutations are selected randomly from dominant features and mutated to new features to prevent the cases where there are fewer random selections.
After learning, the model selects the top 20% of parent objects and moves to the next-generation feature group. The crossover ratio of the features fetched when crossover selects features and produces child objects for the next generation was 8:2. The probability of a mutation was set to 10%, which prevent the features that were not selected for genetic diversity. This process was repeated until the 10th generation feature group was formed, or the RMSE of the current model was no longer improved. Choosing the right fitness function is important for the effectiveness and efficiency of a GA.
V_new= βp_1 + (1 + β)p_2,
where β represents a random number from −0.25 to 1.25, p_1 and p_2 represent the parent gene values, and V represents the child gene value. When the two parental genes were similar, multiple similar traits were evaluated each time in repeated inheritance. Randomness was given when a new offspring gene was produced. According to Equation (1), each parent gene was selected for a new child gene, and the characteristics were reduced and selected randomly when crossing, and the child gene was determined using the GA.
4.3. GA–LSTM
The feature extraction step extracts three domain features as one-dimensional features from each EEG signal. Regression was performed through GA–LSTM. The LSTM-fully connected (FC) model applied to GA–LSTM consists of three LSTM layers and two FC layers, as illustrated in Figure 6 and Figure 7. The input layer consisted of GA outputs (for example, eight timestamps, 142 features, 0.02 drop_out, LSTM_cells, and activation function, optimizer). The output layer produced a value between −0.1 and 0.1 using the active function of the FC layer with one neuron in the last layer to predict one valence value. As LSTM input data, the EEG data purified by feature extraction were input at 0.25 s intervals in 2 s increments. The model was the same as the general LSTM model but included a circulatory neural network capable of maintaining conditions. The state learned in a single batch was transmitted to the next batch. The LSTM was stacked in three layers, allowing a deeper inference than one layer. A dropout layer was added between the LSTM layers to prevent overfitting.
Based on the one-dimensional feature, the initial feature group was selected randomly through the GA to be used as the input of the LSTM. One valence value was regressed with a one-dimensional feature extracted from each image in a sequence. Before being processed by the LSTM, the feature groups were adjusted by the GA. The hidden state vector output from the LSTM allowed new feature groups to be recruited before the next LSTM. The final product of the GA–LSTM was the valence. Both valence and arousal experiments were performed using the MERTI-Apps dataset. 5. Experimental Results
This study used 239 records from the MAHNOB dataset validated in the previous study and 236 records from experiment 3 from the MERTI-Apps dataset. The sequence of each record was annotated with the valence value at 4 Hz intervals. In this study, when the emotional state was recognized, the emotional state was estimated using the data of the previous 2 s, and a scenario suitable for the actual HCI situation was assumed. The learning information of the proposed algorithm was as follows. To compare the recognition performance using the PSD features and all additional features, the LSTM-FC model with the reimplemented model of the previous study was first examined. The performance of each of the three domain and BL features of the GA–LSTM model was confirmed. The maximum epoch was set to 100 times. The feature extraction step was implemented using MATLAB, and the deep learning algorithm was implemented using PyTorch [40]. The learning environment used Intel i5–7 generation RTX 2080.
5.1. MAHNOB-HCI Dataset
For the MAHNOB-HCI dataset, all data except for the continuous-time annotation information were disclosed. The analysis was performed after receiving the continuous-time annotation from the author to ensure that the experiment was conducted in the same environment and dataset as those by Soleymani et al. [15]. Two hundred and thirty-nine records were used. The test accuracy was measured using a 10-fold validation to compare with that by Soleymani et al. For EEG, 2 s of 256 Hz sampling and 32 channels of data were used as the unit data. For the EEG feature, information extracted from feature data of a 2 s interval was used as input to the LSTM as a one-dimensional feature.
Table 6 summarizes the results of the valence experiments using only the PSD feature in the previous study with the MAHNOB dataset and using all extracted features proposed in this study. “N/A” denotes the not applicable cells. The GA–LSTM model took approximately six times longer to evaluate the ultimate features through the GA than the existing learning model [15]. Using only the PSD feature, the LSTM-FC model shows a 3% performance improvement over the previous study [15]. A root-mean-square error (RMSE) of 0.0156 with the GA-LSTM model yielded a 24% improvement over [15] when the brain lateralization (BL) feature was added. This suggests that the feature selection using the GA and many features is useful for emotion recognition. Figure 8 gives an example of the test result of annotation labeling using the MANHOB-HCI dataset. The solid line indicates the change in target valence over time, and the valence predicted by GA-LSTM is indicated by the dotted line. The thresholds for annotation labeling ranged from −0.1 to 0.1. Figure 8a shows positive valence changes over time, and Figure 8b shows the negative valence changes over time. The target and GA–LSTM results tend to be similar. The peculiar thing is that, after the positive or negative valence expression, the GA–LSTM valence tends to return to the neutral value faster than the target valence. The reason is the imbalance of the annotation data. Because the number of positive annotations is greater than the number of negative annotations, it affects the model more, and negative valence or arousal may not be regressed accurately. Furthermore, positive emotions are almost good for facial expressions, while negative emotions are difficult to label by facial expressions because of the severe individual differences. Hence, the labeling method by facial expression is vulnerable to negative emotional expression.
5.2. MERTI-Apps Dataset The performance of the proposed GA–LSTM model was also verified using the MERTI-Apps dataset obtained from Asian participants. Like the MAHNOB-HCI dataset, 60% and 40% of the records except for the test records were used as training and validation, respectively. The test data were measured using a 10-fold validation. The EEG signals were collected through 1024 Hz sampling and a 12-channel electrode and used as the LSTM input as a one-dimensional feature using data from a 2 s interval.
Table 6 lists the performance using the MERTI-Apps dataset. The RMSE 0.0579 of GA-LSTM when a three-domain set and the BL feature were added represents a 33% improvement compared to the RMSE 0.0768 of LSTM-FC in the valence domain. With the MERTI-Apps dataset, the GA applying the model selection and the feature selection for various features was useful for emotion recognition. Figure 9 presents the estimated values of the annotation labeling (target) and GA–LSTM over time for positive valence and negative valence using the MERTI-Apps dataset. Figure 9a suggests that the estimated value of the positive valence at the time of positive expression accurately determines the target value. Figure 9b shows that the estimated value of the negative valence is different from the target value at the time of negative expression. On the other hand, the direction and slope of valence are similar. Table 6 also lists the results of arousal experiments using the MERTI-Apps dataset. The regression performance (RMSE = 0.0287) using the arousal data was better than that using the valence data (RMSE = 0.0579), even though human expression was absent. Hence, continuous emotion regression using arousal data is robust to inner emotion recognition.
5.3. DEAP Dataset
In the field of emotion recognition using biosignals, there are more state-of-the-art studies on the classification model than on the regression model. Therefore, by converting the proposed regression model into a classification model, as shown in Table 6, the classification performance of the proposed GA-LSTM in the DEAP dataset was measured and compared with the existing state-of-the-art studies [16,17,18,23,27,28]. The last column shows the classification accuracy in terms of the valence and arousal perspective. Alhagry et al. [16] and Salama et al. [17] examined the classification performance using the raw EEG signals in the DEAP dataset. Salama et al. reported the highest value among the classification accuracies in Table 6, and the accuracies with respect to valence and arousal were 87.44% and 88.49%, respectively, when emotion was divided into fear, sad, happy, and satisfied. The accuracies reported by Wang et al. [18] with respect to arousal and valence were 74.41% and 73.64%, respectively, when channel selection was applied. They conducted discrete emotion recognition experiments by classifying emotions into high valence, neutral, and low valence.
Owing to the high regression power of continuous emotion recognition, the accuracies of emotion classification using the proposed GA-LSTM were 91.3% and 94.8% with respect to valence and arousal, respectively. The proposed GA-LTSM showed good performance in discrete emotion recognition even when compared to state-of-the-art studies. Therefore, the GA-LSTM model is effective in regression and classification. 6. Discussion and Conclusions
This paper proposed the MERTI-Apps dataset for Asian emotion recognition and developed a method of searching for effective models and feature groups required for emotional classification with EEG signals. To collect data for dataset creation, short and robust stimulation videos were prepared through a content validity inspection in the video collection. Training was conducted to enable an objective evaluation through an in-house production program for annotation labeling. In addition, valence and arousal were evaluated according to complex human emotions. To minimize inconvenience to the users, the data were collected in three experiments to prevent excessive sensor attachment. The models and feature groups that were effective and ineffective for emotion classification were determined by identifying the useful models, features, and channels for emotion recognition. Figure 10 shows the characteristics and channel weights according to the GA–LSTM learning results when the MAHNOB dataset was applied. A darker color means that the features are used more commonly in the GA-LSTM, and the result can be seen as a valid feature. In the MANHOB-HCI dataset, the experimental result showed that the classification accuracy of GA-LSTM was 96.2%. Therefore, the effect of the GA increases with more data.
The regression performance of GA-LSTM was improved by 24% compared to Soleymani et al. [15], owing to effective model selection and active feature group selection using a GA. In the MERTI-Apps dataset based on Asian participants, similar trends were found in the regression performance and classification accuracy for emotion recognition. Although similar results were obtained, the experiment showed that the RMSE in MERTI-Apps dataset was 0.0579, whereas the RMSE in the MAHNOB dataset was 0.0156 in the valence domain. The MERTI-Apps dataset involved a multi-step verification for the accuracy of annotations. For the MAHNOB-HCI dataset, however, the model can be assumed to have performed more rigorously because more experiments were conducted with more cases. In addition, annotating the regression value is more difficult than annotating the binary classification of positive and negative because the assessment of emotion is subjective. To compensate for the discrepancy, additional research on annotation and quantitative measurement methods that exclude subjective factors will be needed.
Based on Figure 6, valid channels were searched and expressed, as shown in Figure 11. In Figure 6a, the MAHNOB dataset results indicated that the electrodes of the right brain, parietal lobes, and occipital lobe were helpful for emotion recognition. Similarly, in Figure 6b, the experimental results of MERTI-Apps also confirm that the electrodes at similar positions are valid. In other words, there is no significant difference between the MAHNOB-HCI dataset based on Westerners and the MERTI-Apps dataset based on Asians. In the experiment using the MERT-Apps dataset, the arousal performance was better than the valence performance when human expression was absent. Moreover, additional bio-signals, such as EMG, EOG, and PPG, can compensate for the characteristics not found by EEG signals.
The RMSE value of the proposed GA–LSTM showed a 33% performance improvement over LSTM-FC owing to the weight evolution effects of the GA for selecting models, features, and channels. On the other hand, learning was slow because the emotion classification was conducted while exploring valid models, features, and channels through GA–LSTM. Therefore, it may be difficult to apply directly to real-time emotion recognition.
The experimental result using the MERTI-Apps dataset still showed a higher RMSE than MAHNOB-HCI. The number of data used was the same, but the MERTI-Apps dataset may have more low-quality data than MANHOB-HCI because it used fewer videos than MANHOB-HCI. Furthermore, EEG is a very sensitive and difficult signal to analyze. Additional bio-signals, such as PPG, GSR, EMG, and EOG, are needed to compensate for this. In experiment 1 of the MERTI-Apps dataset, PNS signals, their valence, and arousal information were collected and studied to determine how to use them efficiently. Methods using observers, such as participants’ evaluation of emotion annotation and annotation labeling, have been used. This is because it is unclear what emotions the signals represent. Moreover, the labeling method by facial expression is vulnerable to negative emotional expression, as shown in Figure 8 and Figure 9. Therefore, it is important to consider annotating methods that can be classified more accurately. In addition, the fusion of biometric signals with voice and image data will improve emotion recognition patterns. Figure 11 shows the training loss and validation loss of the GA-LSTM model in MERTI-Apps, MAHNOB, and DEAP datasets. The stability of the model was improved considerably, owing to the evolutional effect of a genetic algorithm. The present study used a continuous-time annotation for labeling the MAHNOB-HCI and MERTI-Apps databases, but used only self-assessment for labeling the DEAP database. Because it is difficult to apply the label method using only self-assessment to the training of the proposed model, it cannot reduce the loss rate quickly in the case of the DEAP database.
Database | No. Part. | Induced or Natural | Audio | Visual | Peripheral Physio. | EEG | Annotation Labeling |
---|---|---|---|---|---|---|---|
MIT [31] | 17 | Natural | No | No | Yes | No | No |
VAM [32] | 19 | Natural | Yes | Yes | No | No | No |
SEMAINE [33] | 20 | Induced | Yes | Yes | Yes | No | No |
DEAP [11] | 32 | Induced | No | Yes | Yes | Yes | No |
MAHNOB-HCI [12] | 28 | Induced | Yes | Yes (for 22) | Yes | Yes | Valence |
MERTI-Apps | 62 | Induced | Yes | Yes | Yes | Yes | Valence, Arousal |
Participants and Modalities | |
---|---|
Nr. of Participants | 27, 11 male and 16 female |
Recorded signals | 32-channel EEG (256 Hz), Peripheral physiological signals (256 Hz), Face and body video using 6 cameras (60 f/s), Eye gaze (60 Hz) and Audio (44.1 kHz) |
Self-report | Emotional keyword, arousal, valence, dominance, predictability |
Rating values | discrete scale of 1–9 |
Nr. of videos | 20 |
Selection method | Subset of online annotated videos |
Participants and Modalities | |
---|---|
Nr. of Participants | 27, 11 male and 16 female |
Recorded signals | 32-channel EEG (256 Hz), Peripheral physiological signals (256 Hz), Face and body video using 6 cameras (60 f/s), Eye gaze (60 Hz) and Audio (44.1 kHz) |
Self-report | Emotional keyword, arousal, valence, dominance, predictability |
Rating values | discrete scale of 1–9 |
Nr. of videos | 20 |
Selection method | Subset of online annotated videos |
Participants and Modalities | ||
---|---|---|
Nr. of Participants | 62, 28 male and 34 female | |
Recorded signals | 12-channel EEG (1 kHz), 2-channel EMG (1 kHz), 1-channel EOG (1 kHz), PPG (1 kHz), GSR (1 kHz) | |
Selection method | Content Validity Index(CVI) score | |
Self-report | arousal, valence | |
Rating values | 9 kind of Emotion | |
Experiment 1 (Emotion response to videos) | Nr. of videos | 5 (Induced: sad 1, happy 1, angry 2 scared 1) |
Use to signals | EMG, EOG, PPG, GSR | |
Experiment 2 (Emotion response to videos) | Nr. of videos | 5 (Induced: sad 1, happy 1, angry 1 scared 1, neutral 1) |
Use to signals | EEG, EMG, EOG | |
Experiment 3 (Emotion response to videos & tagging) | Nr. of videos | 5 (Induced: sad 1, happy 2, angry 1 scared 1) |
Use to signals | EEG, EOG, PPG, GSR |
Time domain | |
mean, max, min, 1st difference, Normalized 1st difference | |
Frequency domain | |
Power spectrum density(PSD) from Slow alpha, Alpha, Beta, Gamma | mean, max, integral |
Time-frequency domain | |
Power of Discrete Wavelet Transform(DWT) from Slow alpha, Alpha, Beta, Gamma | mean, max, absolute |
Log of Power of Discrete Wavelet Transform(DWT) from Slow alpha, Alpha, Beta, Gamma | |
Abs(Log) of Power of Discrete Wavelet Transform(DWT) from Slow alpha, Alpha, Beta, Gamma | |
Brain Lateralization feature | |
1 × 5 (C3: F8 C4 T4 T6 O2) | 5 × 1 (F7 C3 T3 T5, O1: C4) |
3 × 3 (F7 T3 O1: F8 T4 O2) | 5 × 5 (F7 C3 T3 T5, O1: F8 C4 T4 T6 O2) |
Authors | Dataset | Features | Features Selection | Model Selection | Model | Emotion | Results | |
---|---|---|---|---|---|---|---|---|
RMSE | Accuracy | |||||||
Alhagry et al. [16] | DEAP | raw EEG signals | N/A | N/A | LSTM-RNN | Valence, Arousal | N/A | 85.45%/85.65% |
Salama et al. [17] | DEAP | raw EEG signals | N/A | N/A | 3D-CNN | Valence, Arousal | N/A | 87.44%/88.49% |
Wang et al. [18] | DEAP | Spectrogram, Normalized mutual information- | Channel selection | N/A | SVM | Valence, Arousal | N/A | 74.41%/73.64% |
Becker et al. [23] | HR-EEG recordings | Band power, Conne ctivity, Spectral Crest Factor | Channel/Brain region | N/A | SVM | Valence | N/A | 75% |
Halim et al. [27] | drivers’ EEG dataset | Time and frequenc y domain set | PCA, MI, RSFS | N/A | ensemble of SVM, RF, and NN | Valence | N/A | 88.75% |
Krishna et al. [28] | Dataset of mentally impaired subjects | Mel-frequency Cepstrum Coefficients | N/A | N/A | generalized mixture model | Valence | N/A | 89% |
Soleymani et al. [15] | MAHNOB-HCI | PSD | N/A | N/A | LSTM | Valence | 0.0530 ± 0.0290 | N/A |
Our Works | MAHNOB-HCI | PSD | N/A | N/A | LSTM-FC | Valence | 0.0515 ± 0.0150 | N/A |
MAHNOB-HCI | 3 domain set | Genetic Algorithm | N/A | GA-LSTM | Valence | 0.0485 ± 0.0130 | N/A | |
MAHNOB-HCI | 3 domain set, BL | Genetic Algorithm | Genetic Algorithm | GA-LSTM | Valence | 0.0156 ± 0.0110 | 96.2% | |
MERTI-Apps | PSD | N/A | N/A | LSTM-FC | Valence | 0.0768 ± 0.0150 | N/A | |
MERTI-Apps | 3 domain set | Genetic Algorithm | N/A | GA-LSTM | Valence, Arousal | 0.0679 ± 0.0150/0.0752 ± 0.0350 | N/A | |
MERTI-Apps | 3 domain set, BL | Genetic Algorithm | Genetic Algorithm | GA-LSTM | Valence, Arousal | 0.0579 ± 0.019/0.0287 ± 0.0151 | 65.7%/88.3% | |
DEAP | 3 domain set, BL | Genetic Algorithm | Genetic Algorithm | GA-LSTM | Valence, Arousal | 0.0290 ± 0.012/0.0249 ± 0.027 | 91.3%/94.8% |
Author Contributions
Conceptualization, J.-H.M., D.-H.K. (Dong-Hyun Kang) and D.-H.K. (Deok-Hwan Kim); methodology, J.-H.M.; software, J.-H.M. and D.-H.K. (Dong-Hyun Kang); validation, J.-H.M., D.-H.K. (Dong-Hyun Kang) and D.-H.K. (Deok-Hwan Kim); formal analysis, D.-H.K. (Deok-Hwan Kim); investigation, J.-H.M. and D.-H.K. (Dong-Hyun Kang); resources, J.-H.M.; data curation, J.-H.M. and D.-H.K. (Dong-Hyun Kang); writing-original draft preparation, J.-H.M.; writing-review and editing, D.-H.K. (Deok-Hwan Kim); visualization, J.-H.M.; supervision, D.-H.K. (Deok-Hwan Kim); project administration, D.-H.K. (Deok-Hwan Kim). All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Acknowledgments
This work was supported by the Industrial Technology Innovation Program funded by the Ministry of Trade, Industry, and Energy (MI, Korea) (10073154, Development of human-friendly human-robot interaction technologies using human internal emotional states) and in part by Inha University Research Grant.
Conflicts of Interest
The authors declare no conflict of interest.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Emotional awareness is vital for advanced interactions between humans and computer systems. This paper introduces a new multimodal dataset called MERTI-Apps based on Asian physiological signals and proposes a genetic algorithm (GA)—long short-term memory (LSTM) deep learning model to derive the active feature groups for emotion recognition. This study developed an annotation labeling program for observers to tag the emotions of subjects by their arousal and valence during dataset creation. In the learning phase, a GA was used to select effective LSTM model parameters and determine the active feature group from 37 features and 25 brain lateralization features extracted from the electroencephalogram (EEG) time, frequency, and time–frequency domains. The proposed model achieved a root-mean-square error (RMSE) of 0.0156 in terms of the valence regression performance in the MAHNOB-HCI dataset, and RMSE performances of 0.0579 and 0.0287 in terms of valence and arousal regression performance, and 65.7% and 88.3% in terms of valence and arousal accuracy in the in-house MERTI-Apps dataset, which uses Asian-population-specific 12-channel EEG data and adds an additional brain lateralization (BL) feature. The results revealed 91.3% and 94.8% accuracy in the valence and arousal domain in the DEAP dataset owing to the effective model selection of a GA.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer