Content area
The study of music-generated dance is a novel and challenging image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time-series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a quaternion-enhanced attention network for visual dance synthesis from a quaternion perspective, which consists of a spin position embedding (SPE) module and a quaternion rotary attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from
Introduction
Dancing is a universal language across all cultures [1, 2] and is used by many as a powerful means of self-expression on online media platforms, becoming a dynamic tool for disseminating information on the Internet. Although dance is an art form, it requires professional practice and training to give dancers a rich expressive voice [3]. Therefore, from a computational point of view, music-conditioned 3D dance generation [4, 5, 6–7] has become a critical task that promises to open up a variety of practical applications. However, creating satisfying dance sequences that harmonize with specific music and body structures faces challenges because of our lack of understanding of the timing of human movements and the connection between music and dance. Overcoming these challenges is essential to achieve fluid movements with a high degree of kinematic complexity while ensuring consistency with the complex nonlinear relationships of the accompanying music.
Fig. 1 [Images not available. See PDF.]
Motivation of our method. We compare the effectiveness of our method is compared with other approaches in generating dance movements from seed motions. In the top row labelled “other methods”, two sets of images showcase the transformation of seed movements into unnatural final poses characterized by joint deformation and character drift. Conversely, in the bottom row labelled “our method”, we demonstrate how the application of pre-quaternion parameterization (P), spin position embedding (S), and quaternion attention (Q) yields natural-looking final poses. Each prediction produced by our method successfully learns the correlations between dance and music
Later, with the continuous development and advancement of deep learning, many deep learning methods [6, 7, 8, 9, 10–11] were started to be applied to dance generation. Firstly, RNN [8, 9]-based methods were used to simulate human dances, but RNN approaches would face the challenges of static poses and error accumulation, especially when the input data varied. Subsequently, some researchers have used variational auto-encoders (VAE) and generative adversarial networks (GAN) to model 2D dance movements [12], and then LSTM-auto-encoders were used to model 3D dance movements directly from musical features [13], although such an approach solves the shortcomings of error accumulation that exist in the RNN approach. However, such an approach suffers from the disadvantage of instability and is prone to regress to non-standard poses.
In recent years, transformer [12, 13, 14–15] has been favoured by many in natural language processing as well as visual processing, and some scholars have made great progress in their research [16, 17] to be able to generate high-quality dance movements given a piece of music. However, due to the existence of transformer’s inadequate modelling of the temporal dependence of sequences when dealing with time-series data, the generated dances will suffer from problems such as drifting and foot slipping (As shown in Fig. 1). Given the nonlinear relationship between music and dance, existing approaches to transformer do not fully model this relationship.
Quaternions are widely used as a mathematical tool for rotational expression and gesture control [18]. In view of this, we believe that the introduction of quaternions in the field of dance generation may be a promising approach. Compared to traditional Euler angles, quaternions are more effective in avoiding the “Gimbal Lock” problem and improving the stability of gesture representations. We expect to use quaternions to more accurately adapt dance movements to the rhythm and emotion of the music. By combining musical features with the correlation of quaternions, we can more accurately capture the complex relationship between music and dance.
In this paper, to address these challenges, we propose a quaternion-enhanced attention network for multi-modal dance synthesis (shown in Fig. 2). The network mainly consists of a spin position embedding (SPE) module and a quaternion rotary attention (QRA) module. The SPE module is mainly used in the transformer structure of the network, which embeds information in the form of relative positions into the self-attention. The audio and motion features are extracted by the transformer structure in the network, respectively, and the SPE module combines the advantages of relative position coding and absolute position coding to maximize the model’s representation of sequence features. The extracted audio and motion features are expanded to four dimensions by quaternion parameterization dimension, and then the splicing operation is performed through the quaternion rotary attention module. Compared to the work of Li et al. [16], the proposed SPE better increases the model’s representation and utilization of positional information. Besides, the QRA module better learns the representation of the link between audio and movement relationships, improving the quality of the generated dances with good robustness.
The contribution of the proposed network can be summarized as follows:
In this paper, we introduce a quaternion-enhanced attention network (QEAN) for generating multi-modal dances. This addresses challenges seen in current methods, like awkward joint movements and character inconsistency. QEAN uses quaternion operations to better capture the complex relationship between music and dance, improving the modelling of temporal dependencies.
We introduce the spin position embedding (SPE) module, which computes query and key vectors for features, applies rotational operations, and embeds results into self-attention. SPE addresses limitations of traditional position encoding by introducing relative position encoding based on rotations, enhancing modelling for variable-length sequences while solving length consistency and overfitting issues. Additionally, relative position information improves modelling of intrinsic feature associations, significantly enhancing the model’s representation and utilization of temporal order in human motion.
We introduce the quaternion perspective and propose the quaternion rotary attention (QRA) module. The QRA module maps audio and motion features to the quaternion space and explores the intrinsic correlation between the two using Hamiltonian multiplication, which enables the model to better learn the temporal coordination between music and dance, and generate smooth and natural dances coordinated with the music tempo.
Experimental results on the AIST++ dataset demonstrate that our proposed network is capable of effectively learning the connection between audio and movement, leading to the generation of higher quality dance movements. It outperforms other current state-of-the-art methods in terms of dance quality.
Related work
3D Human Motion Synthesis The research on generating realistic and controllable 3D human motion sequences, as discussed in [8, 19, 20–21], has seen significant advancements in recent years. Initial efforts utilized statistical models like kernel-based probability distributions [22] to synthesize motion, but these methods tended to oversimplify motion details. A subsequent breakthrough came with the introduction of the motion graph approach [23], which addressed this limitation by adopting a nonparametric method. This technique involved constructing a directed graph using motion capture datasets, where each node represented a pose, and edges denoted transitions between poses. Motion generation was achieved through random walks on this graph. However, a notable challenge in motion graphs was the generation of plausible transitions, and certain methods sought to overcome this by introducing parameterizations for transitions [24]. As deep learning gained prominence, several approaches explored the use of neural networks trained on extensive motion capture datasets to generate 3D motion. Various network architectures, including CNNs [20, 25], GANs [26], VAE [27], RNNs [6, 17], and transformers [4, 17], have been investigated. While auto-regressive models like RNNs and pure transformers [28] theoretically have the capacity to generate infinite motion, practical challenges such as mean regression arise. This phenomenon leads to motion “freezing” or drifting into unnatural movements after several iterations. To address this, some studies [28, 29] propose periodic usage of the network’s output as input during the training process. Additionally, phase function neural networks, and their variants have been introduced [30, 31] to tackle the mean regression issue by conditioning network weights on the phase. However, their scalability in representing diverse movements is limited.
Music-Driven Dance Generation In recent years, data-driven deep learning has become the dominant technique for dance generation. Bai et al. [32] used graph convolutional networks to learn from a variety of dance datasets and generate new dance sequences that are smooth and continuous. This deep learning approach significantly improves the quality and continuity of the generated dances. Holden et al. [33], Qiu et al. [34] and Zhu et al. [35] built on deep learning by de-augmenting long-term dependency modelling as a means of generating more coherent long dance sequences. Common approaches include integrating skeleton information and employing attention mechanisms. Li et al. [16] built on that previous work by proposing the full attention cross-modal transformer model (FACT), which can generate non-freezing, high-quality 3D motion sequences conditioned on music by learning audio–motion correspondence sequences.
Quaternion Networks In various domains of deep learning such as few-shot segmentation [36], human motion synthesis [18], and multi-sensor signal processing, quaternion neural networks have made significant strides. Similar to the task discussed in this paper, quaternion representations are employed in neural network architectures as a parameterization for rotations. Quaternion networks, exemplified by QuaterNet [18], utilize quaternions to represent joint rotations in both recurrent neural networks (RNNs) and convolutional neural networks (CNNs). This approach addresses the discontinuity issues associated with Euler angles, achieving outstanding performance in long-term prediction tasks. In the context of our work, focused on music-driven dance generation, we propose constructing a learning process for the correlation between music and dance. This is essential as it requires consideration of the nonlinear characteristics of both motion and music. Therefore, our method involves exploring the relationship between audio and motion features using quaternions. By leveraging quaternions, we aim to enhance the correlation between audio and motion, facilitating the generation of high-quality dance sequences.
Fig. 2 [Images not available. See PDF.]
Overview of our method. a Describes the basic process, which contains three modules i, ii, and iii. When the inputs are a motion sequence with a length of 120 frames and an audio sequence with a length of 240 frames, features are extracted by the motion transformer and the audio transformer, respectively. The extracted features are parameterized by a quadratic parameterization operation, and the dimension is changed to 4 dimensions. Through the spin position embedding (SPE) module, the corresponding four-dimensional features are rotated to embed the information into the self-attention in a rotational manner. The information processed by the SPE is used to explore the coordination between the music and the dance through the quaternionic attentional transformer, and finally, the corresponding dance is generated. i, ii and iii describe the processing of quaternion parameterization, spin position embedding and the basic structure of the quaternion attention transformer, respectively. Specific details are given in the Methods section
Methods
Overview of QEAN
In this paper, we propose a quaternion-enhanced attention network (QEAN) for generating high-quality dances under musical conditions, as illustrated in Fig. 2.
We are given random motion seeds Y of length 120 frames and audio features Z of length 240 frames, where Y can be denoted as and Z can be denoted as . Our objective is to generate a sequence of future motion from to , , where . QEAN first utilizes the two input transformers, the motion transformer and the audio transformer , to encode features and generate motion and audio embeddings, represented as and , respectively. Next, these two embedded features are combined and subjected to a quaternion parameterization operation (see Sect. 3.2 for details). This operation maps the features to four dimensions and embeds the information into the self-attention in a rotational manner using spin position embedding (see Sect. 3.3 for details). Finally, the fused features are learned by a quaternion rotary attention transformer (see Sect. 3.4 for details) to generate the corresponding dance movements.
Quaternion algorithms and quaternion parameterization
We begin by elucidating the fundamental concepts of quaternions crucial for understanding the context of this paper. Quaternions, classified as hyper-complex numbers of rank 4, stand out as a direct and non-commutative extension of complex-valued numbers. In our proposed methodologies, the intricate interplay between Hamilton products and quaternion algebra emerges as the linchpin, forming the cornerstone of our innovative approaches. This exploration of quaternion principles lays the groundwork for the subsequent discussions and applications detailed in this study.
A quaternion Q in quaternion domain D, QD, can be represented as:
1
where e, f, g and h are real numbers, and i,j and k are the quaternion unit basis. In a quaternion, e is the real part, where fi + gj + hk, with = = = ijk = is the imaginary part.A pure quaternion is a quaternion whose real part is 0, resulting in the vector Q = fi + gj + hk. Operations on quaternions are defined as follows.
The addition of two quaternions is defined as:
2
where Q and P with subscripts denote the real and imaginary parts of the quaternions Q and P.The multiplication with scalar can be defined as:
3
The conjugate complex of Q can be defined as:4
The multiplication of quaternions Q and R can be defined as follows:5
The equation above clearly describes the exchange between quaternions Q and R, indicating that Hamiltonian product is essential in quaternion neural networks. In this study, we extensively employ the Hamiltonian product to learn the correlations between music and dance, which forms the foundation of QEAN and enhances its generalization ability.To implement our approach, we combine music and motion features to create a feature vector. Specifically, 35-dimensional music features and 219-dimensional motion features can be combined into a 254-dimensional feature vector through concatenation, based on dimension and time correspondence. Subsequently, we convert this concatenated feature vector into a sequence of quaternions. In this process, each music feature and three-dimensional dance motion feature are broken down into four components, representing a quaternion with a real part and three imaginary parts. As a result, the original 254-dimensional feature vector is transformed into a quaternion sequence with a length of 63 (disregarding the last two dimensions as they are insufficient to form a complete quaternion). Finally, we input this quaternion sequence into the spin position embedding for further processing. In this model, a position encoding is assigned to each quaternion, enabling the capture of position information within the sequence.
By incorporating position information, the model gains a better understanding of the sequence and improves its performance accordingly.
Fig. 3 [Images not available. See PDF.]
General situation of spin position embedding. Specifically, the input action sequences and audio sequences in this paper are given feature vector representations after being encoded by their respective transformers. The feature vectors of the action sequences are , and the feature vectors of the audio sequences are . These word vectors are then multiplied by different rotation matrices , according to their positions m and n in their respective sequences to achieve the positional information of fusion. Finally, the encoded vectors of the action sequences are transformed into query vectors , and the rotationally transformed key vectors of the encoded audio sequences are run on a click to compute the correlation between the two modal sequences. With this spin position embedding, the modality can better model the positional information of the two sequences, as well as the correlation between them, thus increasing the learning of cross-modal representations
Spin position embedding
The main types of position embedding methods are relative position embedding and absolute position embedding methods. In 2017, the transformer [14] model was proposed. The concept of positional embedding was introduced in this model to provide information about the position of each word or token in this input sequence. This is crucial for NLP tasks that heavily rely on the relative position of words. In the transformer model, positional embedding is used to encode information about different positions using sine and cosine functions, and the positional embedding is updated at different frequencies for different dimensions. In this way, the model is able to learn the relative positions of the tokens in the sequence. This way of position coding with sine and cosine, which is also known as absolute position coding, is easy to implement and relies directly on the absolute position without position loss, but this type of coding has poor generalization ability, the model only adapts to a specific length of absolute position coding, and is prone to overfitting when the length varies, and performs poorly on tasks in long sequences. Relative positional embedding, on the other hand, is a method of obtaining positional embedding by using the relative distance or order relationship between lexical elements to rely on. This embedding method can provide relative position information between lexical elements instead of relying completely on absolute position. Such an embedding approach highlights the relevance of lexical elements in terms of content, which is conducive to content comprehension and avoids the excessive computation caused by the exponential growth of absolute positional embedding with position. Consequently, it improves the generalization ability.
Motivated by the work of Su et al. [37], who proposed the rotary position embedding, a positional embedding method is designed to enhance the performance of the transformer architecture by integrating relative positional information into self-attention. The popular LLama2 [9] model currently employs this position embedding approach. Therefore, we borrowed from Su and embedded the extracted audio and motion features into self-attention in the form of rotated positions to better learn the features in it and improve the computational efficiency. The basic idea is shown in Fig. 3.
First, we define a sequence of features of length N (since motion and audio features operate similarly in the process of SPE, the O in the next equation represents different operations for different eigenvectors in different situations): , where represents the ith token in the input sequence, and the embedding corresponding to the input sequence is denoted as , where represents the d-dimensional embedding vector of the ith token .
Before performing self-attention operations, we use the feature embedding vectors to calculate the q, k, and v vectors and incorporate the corresponding positional information. The function expressions are as follows:
6
7
8
Here, represents the query vector for the sth token with positional information s integrated into the feature vector , while and represent the key and value vectors for the tth token with positional information t integrated into the feature vector .Then, we need to compute the output of self-attention for the sth feature embedding vector . This involves calculating an attention score between and other , and then multiplying the attention score by the corresponding , followed by summation to obtain the output vector :
9
10
Next, in order to leverage the relative positional relationships between the mentioned tokens, let us assume that the dot product operation between the query vector and the key vector is represented by a function g. The input to function g includes the word embedding vectors , and their relative position s-t:11
We then discover an alternative approach to position embedding that upholds the aforementioned relationship.12
13
14
Here, x represents any real number, e is the base of the natural logarithm, and i is the imaginary unit in complex numbers.We can cleverly use Euler’s formula , where the real part is cosx and the imaginary part sinx is of a complex number.
After transformation, formulas O and g can be changed to:
15
16
17
18
Then, according to linear algebra, we can represent using a matrix:19
20
Therefore, multiplying these two complex numbers, we get the following result:21
Then, we magically discover that the above expression is equal to the query vector multiplied by a rotation matrix:22
Similarly, the key vector can be represented as follows:23
By rearranging the above formulas, we can simplify the following expression:24
With the above formulas, we can summarize the following calculation process: In simple terms, the process of self-attention with spin position embedding involves, for each feature embedding vector in the token sequence, first calculating its corresponding query and key vectors. Then, for each token position, calculate the corresponding rotated position embedding information. After that, apply the rotation transformation to the elements of the query and key vectors for each token position in pairs, and finally, calculate the dot product between the query and key to obtain the result of self-attention.Quaternion rotary attention
For the features after the rotated attention module, we assume that there are N-length query series X and an M-length key-values series . Firstly, the original and are projected onto the representation space, and a series of operations are performed: , and .
Fig. 4 [Images not available. See PDF.]
Overall of our transformer structure. Our transformer structure enhances the generalization ability of the model by adding regularization means such as dropout in multiple places and adjusting the number of attention heads to expand the model capacity on the basis of the original. The absolute position information of the input sequence is converted into a polar coordinate representation of the relative position using spin position embedding, (, ), where denotes the distance from the centre point, and denotes the relative angle. This spin position embedding module provides better local relative positions with some rotational invariance. In this way, our model can better support some tasks that are sensitive to position information, such as behavioural sequence modelling and 3D shape analysis
Here, Q represents the query vector, K represents the key, V represents the value, d represents the number of channels in the attention layer, and W represents the trainable weights. Then, QRA will calculate to the map query series to output H using key-value series (Fig. 4).
Frequency/phase generation:
25
Series rotation:26
Series attention with softmax kernel (shown in Fig. 5)27
Series aggregation:28
Here, we hypothesize that the series have P periods, and P is a hyper-parameter. In frequency/phase generation step, we utilize 1D convolutions with activation ReLU to generate P latent frequencies ( is similar). Convolutions can effectively capture local contexts of each time step to generate reliable latent frequencies, and these latent frequencies are not identical at each time step implying variable periods. Moreover, to account for phase shifts, we additionally generate P latent phases using 1D convolutions with activation ( is similar). In series rotation step, we rotate the representations at each time step according to the learned latent frequencies and phases in the previous step. Each row vector of is in the quaternion form of the corresponding row vector of Q, k, and , are position vectors of series Q and K, respectively. In the series attention step, to integratedly capture position-wise similarity under multiple periods, the unnormalized similarity is the mean of quaternion dot product under multiple rotations. Finally, in the series aggregation step, the outputs is generated using the softmax-normalized similarity. In practice, we employ the multi-head variant of QRA and will not go into details here, as it can be derived quite directly. Notice that QRA is more expressive than canonical dot product attention. When P = 1, = 0 and = 0, QRA degenerates into canonical attention.Fig. 5 [Images not available. See PDF.]
A three-dimensional illustration of a rotated softmax kernel. The rotated softmax kernel represents the embeddings in quaternion form and rotates them using the angular frequency . Thus, embeddings with different phases can be distinguished. Finally, the similarity of the rotated embeddings is measured by measuring the exponential dot product between them
Experiments
Datasets
The AIST++ [38] dance movement dataset was constructed from the AIST dance [16] video database. A well-developed process was designed for estimating camera parameters, 3D human keypoints and 3D human dance movement sequences from multi-view videos. The dataset provides 3D human keypoint annotations and camera parameters for 10.1 million images covering 30 different subjects in 9 viewpoints. These features make it the largest and richest dataset containing 3D human keypoint annotations currently available. Additionally, the dataset contains 1408 3D human dance movement sequences represented as joint rotations and root trajectories. These dance movements are evenly distributed across 10 dance genres and contain hundreds of choreographies. The duration of the movements ranges from 7.4 to 48.0 s, and each dance movement is accompanied by corresponding music. Based on these annotations, AIST++ is designed to support multiple tasks including multi-view human keypoint estimation, human motion prediction/generation, and cross-modal analysis between human motion and music.
Implementation details
In our primary experiments, the model takes a seed motion sequence spanning 120 frames (2 s) and a music sequence covering 240 frames (4 s) as input. These two sequences are aligned at the initial frame, and the model’s output consists of a future motion sequence with frames supervised by L2 loss. During the inference process, future motions are continuously generated in an auto-regressive manner at 60 FPS, with only the first predicted motion retained at each step. For music feature extraction, we employ the publicly available audio processing toolbox, Librosa [39], which includes one-dimensional envelope, 20-dimensional MFCC, 12-dimensional chroma, one-dimensional one-hot peaks, and one-dimensional one-hot beats, resulting in a 35-dimensional music feature. The motion features combine a nine-dimensional representation of rotation matrices for all 24 joints with a three-dimensional global translation vector, resulting in a 219-dimensional motion feature. These raw audio and motion features are initially embedded into 800-dimensional hidden representations using linear layers, with learnable position embeddings added before inputting them into the transformer layers. All three transformers (audio, motion, cross-modal) feature 16 attention heads with a hidden size of 800. In terms of training details, all experiments are trained using the Adam optimizer with a batch size of 16. The learning rate starts at 1e−4 and decays to {1e−5,1e−6} at {90 k, 150 k} steps, respectively. Training concludes after 500k steps, taking approximately 2 days on one RTX 3090. The baseline comparison includes the latest works on 3D dance generation with music and seed motion as input, such as Li et al. [4, 16]. For a more comprehensive evaluation, we also compare it with the recent state-of-the-art 2D dance generation method, DanceRevolution [5]. We adapt this work to output 3D joint positions for a direct quantitative comparison with our results, even though joint positions do not allow for immediate repositioning. The official code provided by the authors is used to train and test these baselines on the same dataset as ours.
Quantitative evaluation
In this section, we assess the performance of our proposed multi-modal roformer across three key dimensions: (1) motion quality, (2) generation diversity, and (3) motion-music correlation. The results presented in Table 1 demonstrate that under identical experimental conditions, our model surpasses state-of-the-art methods [2, 6, 7] in these aspects.
Motion Quality Similar to previous studies, we assess the quality of generated motion by computing the Frechet inception distance (FID) [40], which measures the dissimilarity between the distribution of generated motion and ground-truth motion. To capture motion features, we utilize two meticulously crafted motion feature extractors, as undisclosed motion encoders were employed in earlier works [41]. These extractors include: (1) a geometric feature extractor, generating a Boolean vector that represents geometric relationships among specific body points in the motion sequence, and (2) a dynamic feature extractor, mapping the motion sequence to capture dynamic aspects such as velocity and acceleration. We designate FID based on these geometric and dynamic features as FID and FID, respectively. The metrics are computed by comparing real dance motion sequences in the AIST++ test set with 40 generated motion sequences, each comprising frames (20 s). As depicted in Table 1, our generated motion sequences exhibit distributions that are closer to ground-truth motion compared to the three methods.
Table 1. Conditional motion generation evaluation on AIST++ dataset
Methods | Motion quality | Motion diversity | Motion-music corr | ||
|---|---|---|---|---|---|
FIDk | FIDg | Distk | Distg | BeatAlign | |
AIST++ | – | – | 9.057 | 7.556 | 0.292 |
AIST++ (random) | – | – | – | – | 0.213 |
Li et al. [5] | 86.43 | 20.58 | 6.85 | 4.93 | 0.232 |
Dancenet [6] | 69.18 | 17.76 | 2.86 | 2.72 | 0.232 |
DanceRevolution [7] | 73.42 | 31.01 | 3.52 | 2.46 | 0.22 |
FACT (baseline) [16] | 48.95 | 28.1 | 4.9 | 6.69 | 0.232 |
Our | 30.1 | 11.5 | 7.82 | 9.37 | 0.239 |
Comparing to the three recent state-of-the-art methods, our generated motions are more realistic, better correlated with input music and more diversified
Table 2. Ablation study on spin position embedding and quaternion rotary attention
BeatAlign | |||
|---|---|---|---|
Baseline | 48.95 | 28.1 | 0.232 |
Baseline + Spin Position Embedding | 30.1 | 11.5 | 0.239 |
Baseline + QRA | 46.33 | 26.2 | 0.236 |
As illustrated in the table, the experimental results clearly demonstrate the effectiveness of our proposed method. The graph shows a significant improvement in performance metrics when compared to the baseline approach
Fig. 6 [Images not available. See PDF.]
Frame extraction. The visual representation clearly emphasizes the effectiveness of our proposed method
Fig. 7 [Images not available. See PDF.]
This image illustrates the frame extraction effect of a dance generated by alternative methods. During the post-production phase, the generated dance movements exhibit phenomena of dance collapse and unscientific limb folding
Generation Diversity We also assess the model’s capacity to generate diverse dance movements in response to different input music, comparing its performance to baseline methods. Following a methodology similar to previous research [42], we compute the average Euclidean distance in the feature space of 40 generated motions from the AIST++ test set to quantify diversity. The motion diversity in geometric and dynamic feature spaces is denoted as Dist and Dist, respectively. Table 1 illustrates that our method excels in generating more diverse dance movements in comparison with the baselines, with the exception of Tendulkar et al. [26]. The latter discretizes motions, resulting in discontinuous outputs and elevated .
Motion–Music Correlation Moreover, we gauge the correlation between the generated 3D motion and input music by introducing a novel metric known as the Beat Alignment Score. This metric evaluates the motion–music correlation by measuring the similarity between the beats in the motion and music. Librosa [39] is employed to extract music beats, while motion beats are computed as local minima in the motion velocity. The Beat Alignment Score is articulated as the average distance between each motion beat and its nearest music beat. To be specific, our Beat Alignment Score is defined as:
29
where is the set of motion beats, is the music beats, and is a parameter for normalizing sequences with different FPS.We set for all experiments since the FPS for all our experimental sequences is 60. A similar metric called Beat Hit Rate is introduced in, but it relies on manually set thresholds for alignment (“hits”) depending on the dataset, while our metric directly measures distances. This metric is explicitly designed to be unidirectional, as dance movements do not necessarily need to match every music beat. On the other hand, each dynamic beat should have a corresponding music beat. To calibrate the results, we compute correlation metrics for the entire AIST++ dataset (upper bound) and randomly paired data (lower bound). As shown in Table 1, our generated motion shows better correlation with input music compared to the baselines. However, there is still considerable room for improvement for all methods, including ours, when compared to actual data. This reflects that music–motion correlation remains a challenging problem.
Ablation study
We conducted ablation studies on the spin position embedding and multi-modal quaternion parameterization, respectively. The quantitative scores are shown in Table 2.
Position Embedding In the ablation experiments focused on position coding, we explore two distinct approaches and conduct experiments based on the following configurations: (1) a learnable coding approach for absolute positions (baseline), and (2) a rotary coding approach for relative positions. Method 2 was selected to introduce explicit relative position dependence in the self-attention formulation. This choice offers increased flexibility in sequence length, a potential reduction in dependencies between tokens, and the capacity to encode relative positions for linear self-attention. As illustrated in Table 2, we observe that the rotational embedding method of relative position leads to a significant reduction in the FID values compared to the original method. This indicates that dances generated using rotary position embedding are notably closer to reality.
Quaternion parameterization Here, we performed ablation experiments on the original baseline as well as with the addition of the QRA module. Through Table 2, we can observe that quaternion rotary attention (QRA), by introducing quaternion operations, is able to fully explore the relationship between audio and motion and achieves more significant enhancement results (Figs. 6, 7).
Conclusion
We propose a network called QEAN for generating 3D dance movements. QEAN employs spin position embedding (SPE) the position encoding part to embed the position information in a rotational manner in the self-attention, which improves the model’s representation of the sequences and enhances the model’s understanding of the human movements in terms of their temporal order. Additionally, we propose quaternion rotary attention (QRA), a quaternion-valued relational learning network, which uses quaternion values to explore the temporal coordination between music and dance. To demonstrate the superiority of QEAN, we conducted experiments on the AIST++ dataset. The results of the relevant experimental data demonstrate the superiority of our approach in the 3D dance generation task. Furthermore, the results of our ablation experiments illustrate the importance of SPE and QRA in this task.
Acknowledgements
This work was supported in part by National Natural Science Foundation under Grant 92267107, the Science and Technology Planning Project of Guangdong under Grant 2021B0101220006, Science and Technology Projects in Guangzhou under Grant 202201011706, Key Areas Research and Development Program of Guangzhou under Grant 2023B01J0029, Science and technology research in key areas in Foshan under Grant 2020001006832, Key Area Research and Development Program of Guangdong Province under Grant 2018B010109007 and 2019B010153002, Science and technology projects of Guangzhou under Grant 202007040006, and Guangdong Provincial Key Laboratory of Cyber-Physical System under Grant 2020B1212060069.
Author Contributions
Methodology, Zhizhen Zhou, Yejing Huo; Writing original draft, Zhizhen Zhou, Yejing Huo; Writing review and editing, Guoheng Huang, Xuhang Chen, Lian Huang, Zinuo Li; Supervision, Guoheng Huang and An Zeng; Funding acquisition, Guoheng Huang, Lian Huang, An Zeng. All authors have read and agreed to the published version of the manuscript.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Yang, Y; Zhang, E. Cultural thought and philosophical elements of singing and dancing in Indian films. Trans/Form/A ç ã o; 2023; 46, pp. 315-328. [DOI: https://dx.doi.org/10.1590/0101-3173.2023.v46n4.p315]
2. Siciliano, M. A citation analysis of business librarianship: examining the Journal of Business and Finance Librarianship from 1990–2014. J. Bus. Finance Librariansh.; 2017; 22, pp. 81-96. [DOI: https://dx.doi.org/10.1080/08963568.2017.1285747]
3. Aristidou, A; Stavrakis, E; Papaefthimiou, M; Papagiannakis, G; Chrysanthou, Y. Style-based motion analysis for dance composition. Vis. Comput.; 2018; 34, pp. 1725-1737. [DOI: https://dx.doi.org/10.1007/s00371-017-1452-z]
4. Li, Ji., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to generate diverse dance motions with transformer. In: arXiv:2008.08171. https://api.semanticscholar.org/CorpusID:221173065 (2020)
5. Huang, R., Hu, H., Wu, W., Sawada, K., Zhang, M., Jiang, D.: Dance revolution: long-term dance generation with music via curriculum learning. In: International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:235614403 (2020)
6. Zhang, X., Xu, Y., Yang, S., Gao, L., Sun, H.: Dance generation with style embedding: learning and transferring latent representations of dance styles. In: arXiv:1041.4802. https://api.semanticscholar.org/CorpusID:233476346 (2021)
7. Huang, R., Hu, H., Wu, W., Sawada, K., Zhang, M., Jiang, D.: Dance revolution: long-term dance generation with music via curriculum learning. In: International conference on learning representations. https://api.semanticscholar.org/CorpusID:235614403 (2020)
8. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.M.: Scheduled sampling for sequence prediction with recurrent neural networks. In: arXiv:1506.03099. https://api.semanticscholar.org/CorpusID:1820089 (2015)
9. Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3492–3501. https://api.semanticscholar.org/CorpusID:182952539 (2019)
10. Sheng, B; Li, P; Ali, R; Philip Chen, CL. Improving video temporal consistency via broad learning system. IEEE Trans. Cybern.; 2022; 52,
11. Xie, Z; Zhang, W; Sheng, B; Li, P; Chen, CP. BaGFN: broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst.; 2021; 34, pp. 4499-4513. [DOI: https://dx.doi.org/10.1109/TNNLS.2021.3116209]
12. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: arXiv:2010.11929. https://api.semanticscholar.org/CorpusID:225039882 (2020)
13. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF international conference on computer vision (ICCV), pp. 9992–10002. https://api.semanticscholar.org/CorpusID:232352874 (2021)
14. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:13756489 (2017)
15. Lin, X; Sun, S; Huang, W; Sheng, B; Li, P; Feng, DD. EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multimed.; 2021; 25, pp. 50-61. [DOI: https://dx.doi.org/10.1109/TMM.2021.3120873]
16. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13381–13392. https://api.semanticscholar.org/CorpusID:236882798 (2021)
17. Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Zi.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049. https://api.semanticscholar.org/CorpusID:247627867 (2022)
18. Pavllo, D; Feichtenhofer, C; Auli, M; Grangier, D. Modeling human motion with quaternion-based neural networks. Int. J. Comput. Vis.; 2019; 128, pp. 855-872. [DOI: https://dx.doi.org/10.1007/s11263-019-01245-6]
19. Ma, W., Yin, M., Li, G., Yang, F., Chang, K.: PCMG:3D point cloud human motion generation based on self-attention and transformer. In: The Visual Computer. https://api.semanticscholar.org/CorpusID:261566852 (2023)
20. Greenwood, D., Laycock, S.D., Matthews, I.: Predicting head pose from speech with a conditional variational autoencoder. In: Interspeech. https://api.semanticscholar.org/CorpusID:11113871 (2017)
21. Huang, Y., Zhang, J., Liu, S., Bao, Q., Zeng, D., Chen, Z., Liu, W.: Genre-conditioned long-term 3D dance generation driven by music. In: ICASSP 2022—2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4858–4862. https://api.semanticscholar.org/CorpusID:249437513 (2022)
22. Hochreiter, S; Schmidhuber, J. Long short-term memory. Neural Comput.; 1997; 9, pp. 1735-1780. [DOI: https://dx.doi.org/10.1162/neco.1997.9.8.1735]
23. Yu, Q., He, J., Deng, X., Shen, X., Chen, L.-C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional CLIP. In: arXiv:2308.02487. https://api.semanticscholar.org/CorpusID:260611350 (2023)
24. Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, pp. 6558–6569. https://api.semanticscholar.org/CorpusID:173990158 (2019)
25. Wu, Z., Xu, J., Zou, X., Huang, K., Shi, X., Huang, J.: EasyPhoto: your smart AI photo generator. https://api.semanticscholar.org/CorpusID:263829612 (2023)
26. Tendulkar, P., Das, A., Kembhavi, A., Parikh, D.: Feel the music: automatically generating a dance for an input song. In: arXiv:2006.11905. https://api.semanticscholar.org/CorpusID:219572850 (2020)
27. Kundu, J.N., Buckchash, H., Mandikal, P., Jamkhandi, A., Radhakrishnan, V.B.: Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions. In: 2020 IEEE winter conference on applications of computer vision (WACV), pp. 2713–2722. https://api.semanticscholar.org/CorpusID:214675800 (2020)
28. Li, L., Lei, J., Gan, Z., Yu, L., Chen, Y.-C., Pillai, R.K., Cheng, Y., Zhou, L., Wang, X.E., Wang, W.Y., Berg, T.L., Bansal, M., Liu, J., Wang, L., Liu, Z.: VALUE: a multi-task benchmark for video-and-language understanding evaluation. In: arXiv:2106.04632. https://api.semanticscholar.org/CorpusID:235377363 (2021)
29. Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 2017 International Conference on 3D Vision (3DV), pp. 458–466. https://api.semanticscholar.org/CorpusID:13549534 (2017)
30. Wu, C., Yin, S.-K., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. In: arXiv:2303.04671. https://api.semanticscholar.org/CorpusID:257404891 (2023)
31. Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: GLM: general language model pretraining with autoregressive blank infilling. In: Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:247519241 (2021)
32. Bai, Z; Chen, X; Zhou, M; Yi, T; Chien, W-C. Low-rank multimodal fusion algorithm based on context modeling. J. Internet Technol.; 2021; 22,
33. Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. (TOG) 35, 1–11 (2016)
34. Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3D human pose estimation. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp. 4341–4350. https://api.semanticscholar.org/CorpusID:201891326 (2019)
35. Zhu, Y., Olszewski, K., Wu, Y., Achlioptas, P., Chai, M., Yan, Y., Tulyakov, S.: Quantized GAN for complex music generation from dance videos. In: arXiv:2204.00604. https://api.semanticscholar.org/CorpusID:247922422 (2022)
36. Zheng, Z; Huang, G; Yuan, X; Pun, C-M; Liu, H; Ling, W-K. Quaternion-valued correlation learning for few-shot semantic segmentation. IEEE Trans. Circuits Syst. Video Technol.; 2023; 33, pp. 2102-2115. [DOI: https://dx.doi.org/10.1109/TCSVT.2022.3223150]
37. Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: RoFormer: enhanced transformer with rotary position embedding. In: arXiv:2104.09864. https://api.semanticscholar.org/CorpusID:233307138 (2021)
38. Tsuchida, S., Fukayama, S., Hamasaki, M., Goto, M.: AIST dance video database: multi-genre, multi-dancer, and multi-camera database for dance information processing. In: International Society for Music Information Retrieval Conference. https://api.semanticscholar.org/CorpusID:208334750 (2019)
39. McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., Nieto, O.: librosa: audio and music signal analysis in python. In: SciPy. https://api.semanticscholar.org/CorpusID:33504 (2015)
40. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:326772 (2017)
41. Onuma, K., Faloutsos, C., Hodgins, J.K.: FMDistance: a fast and effective distance function for motion capture data. In: Eurographics. https://api.semanticscholar.org/CorpusID:8323054 (2008)
42. Tan, H.H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:201103729 (2019)
Copyright Springer Nature B.V. Jan 2025